You are viewing a plain text version of this content. The canonical link for it is here.
Posted to apache-bugdb@apache.org by Phillip Ezolt <ez...@perf.zko.dec.com> on 1999/02/17 22:30:58 UTC

os-linux/3911: Under high load, server hangs in "flock or fnctl".

>Number:         3911
>Category:       os-linux
>Synopsis:       Under high load, server hangs in "flock or fnctl".
>Confidential:   no
>Severity:       critical
>Priority:       medium
>Responsible:    apache
>State:          open
>Class:          sw-bug
>Submitter-Id:   apache
>Arrival-Date:   Wed Feb 17 13:40:01 PST 1999
>Last-Modified:
>Originator:     ezolt@perf.zko.dec.com
>Organization:
apache
>Release:        1.3.4
>Environment:
Linux crappy.zko.dec.com 2.2.1 #5 Fri Feb 12 09:07:00 EST 1999 i686 unknown
gcc version egcs-2.91.60 19981201 (egcs-1.1.1 release)
OS version: Redhat 5.2
Intel PII/266 with FDDI interface card.

Kernel compile with the following things changed: 
I have changed the following kernel values in /usr/src/linux/include/net/tcp.h 
to:

#define TCP_HTABLE_SIZE         2048 (was 512)
#define TCP_LHTABLE_SIZE        128 (was 32)
#define TCP_BHTABLE_SIZE        2048   (was 512)                     

Everything is on a local filesystem.  The lockfiles are NOT on NFS.
>Description:
I can repeatably get the server to stop responding after signifcantly stressing
the system.   Initially, I had apache compiled with flock serialization.  After
a while, a large number of the httpd processes were stuck in the following state:

#0  0x400d49c1 in flock ()
#1  0x805aaa9 in accept_mutex_on ()
#2  0x805d6a5 in child_main ()
#3  0x805dc68 in make_child ()
#4  0x805dfe1 in perform_idle_server_maintenance ()
#5  0x805e4e9 in standalone_main ()
#6  0x805ea7b in main () 

There were a few with the following (What they SHOULD be.. )
#0  0x400de5c2 in __libc_accept ()
#1  0x805d7bc in child_main ()
#2  0x805dc68 in make_child ()
#3  0x805dd17 in startup_children ()
#4  0x805e328 in standalone_main ()
#5  0x805ea7b in main () 

When I would try to connect to the server (lynx http://127.0.0.1), it would 
just hang.  Normally, the response would be instaneous. 

I tried to recompile apache with FCNTL support, and the same thing occurs. 
This time the stack trace is:

0  0x400d4974 in __libc_fcntl ()
#1  0x1 in ?? ()
#2  0x805d66d in child_main ()
#3  0x805dc30 in make_child ()
#4  0x805dcdf in startup_children ()
#5  0x805e2f0 in standalone_main ()
#6  0x805ea43 in main ()  

There is some kind of race condition that occurs under a very heavy load. 

I am not sure if it is a linux, apache, or even glibc bug, but I really want to
get a good result here. 







>How-To-Repeat:
The load is SPECWeb96.  When I try to push my system above 60 Ops/Sec, this
occurs. I don't have an easy way for an external site to repeat it, but for the 
next week and a half, it is all I will be working on.  So, I can easily try out
any patches that anyone may have.  
>Fix:
None. 
>Audit-Trail:
>Unformatted:
[In order for any reply to be added to the PR database, ]
[you need to include <ap...@Apache.Org> in the Cc line ]
[and leave the subject line UNCHANGED.  This is not done]
[automatically because of the potential for mail loops. ]
[If you do not include this Cc, your reply may be ig-   ]
[nored unless you are responding to an explicit request ]
[from a developer.                                      ]
[Reply only with text; DO NOT SEND ATTACHMENTS!         ]