You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Brandon Ehle <az...@yahoo.com> on 2002/12/10 15:03:54 UTC

First httpd deadlock fixed, but found another one

> 
>
>You didn't say what your tests do, so it's going to be hard for us to
>identify which process crashed :)  You are really the only one who can
>do that.
>  
>
My tests run through a import, checkout, status, status -u, & update (in 
that order), while steadily increasing the amount of files.  It does 
this once via ra_local, then it does the same thing over ra_dav and 
reports the difference in time.


The last deadlock that I posted seems to have gone away when I followed 
Justin Ererkrantz's advice and update to DB4.1.  With DB4.1 mod_dav_svn 
is much more stable and can handle much more load than DB4.0 could. 


I had previously thought that all the deadlocks were fixed, but I've 
uncovered another one.  Although subversion never failed any of the 
operations, httpd is deadlocking when attempting to shut it down.  This 
happened after runnning the import, checkout, status, status -u, & 
update commands on a working copy containing 8333 files.  The previous 
run where it threw 6666 files at mod_dav_svn worked fine and it also 
worked with 8333 files over ra_local.


Here is the backtrace of the threads in apache, note that this is the 
backtrace after some of the other threads have exited.


(gdb) thread apply all bt

Thread 4 (Thread 139282 (LWP 24116)):
#0  0x401f4ae9 in __memp_fopen_int () from /lib/libdb-4.1.so
#1  0x401f45b4 in __memp_fopen () from /lib/libdb-4.1.so
#2  0x401a1101 in __db_dbenv_setup () from /lib/libdb-4.1.so
#3  0x401b3bf6 in __db_dbopen () from /lib/libdb-4.1.so
#4  0x401b391f in __db_open () from /lib/libdb-4.1.so
#5  0x40408ce5 in svn_fs__open_nodes_table () from 
/root/rpms/build/subversion/subversion/libsvn_fs/.libs/libsvn_fs-1.so.0
#6  0x4040f8b3 in svn_fs_open_berkeley () from 
/root/rpms/build/subversion/subversion/libsvn_fs/.libs/libsvn_fs-1.so.0
#7  0x403fcb77 in get_repos () from 
/root/rpms/build/subversion/subversion/libsvn_repos/.libs/libsvn_repos-1.so.0
#8  0x403fccd5 in svn_repos_open () from 
/root/rpms/build/subversion/subversion/libsvn_repos/.libs/libsvn_repos-1.so.0
#9  0x403e6b9a in dav_svn_get_resource () from 
/etc/httpd/modules/mod_dav_svn.so
#10 0x4031aa4a in dav_get_resource () from /etc/httpd/modules/mod_dav.so
#11 0x4031aeaf in dav_method_put () from /etc/httpd/modules/mod_dav.so
#12 0x080684d5 in ap_run_handler ()
#13 0x08068aed in ap_invoke_handler ()
#14 0x080636a6 in ap_process_request ()
#15 0x0805ecac in ap_process_http_connection ()
#16 0x08071b65 in ap_run_process_connection ()
#17 0x08064b7c in process_socket ()
#18 0x0806532d in worker_thread ()
#19 0x4025cb06 in dummy_worker () from /usr/lib/libapr-0.so.0
#20 0x402d3941 in pthread_start_thread () from /lib/i686/libpthread.so.0

Thread 3 (Thread 131089 (LWP 24114)):
#0  0x420d3b2e in select () from /lib/i686/libc.so.6
#1  0x402254a0 in __JCR_LIST__ () from /lib/libdb-4.1.so
#2  0x401fb191 in __os_yield () from /lib/libdb-4.1.so
#3  0x401773bf in __db_tas_mutex_lock () from /lib/libdb-4.1.so
#4  0x401f8d6f in __memp_sync_files () from /lib/libdb-4.1.so
#5  0x401f86ec in __memp_sync_int () from /lib/libdb-4.1.so
#6  0x401f8182 in __memp_sync () from /lib/libdb-4.1.so
#7  0x4020f3af in __txn_checkpoint () from /lib/libdb-4.1.so
#8  0x4040f052 in cleanup_fs () from 
/root/rpms/build/subversion/subversion/libsvn_fs/.libs/libsvn_fs-1.so.0
#9  0x4040f108 in cleanup_fs_apr () from 
/root/rpms/build/subversion/subversion/libsvn_fs/.libs/libsvn_fs-1.so.0
#10 0x40261dfd in run_cleanups () from /usr/lib/libapr-0.so.0
#11 0x4026154d in apr_pool_destroy () from /usr/lib/libapr-0.so.0
#12 0x40261608 in apr_pool_destroy () from /usr/lib/libapr-0.so.0
#13 0x40261508 in apr_pool_clear () from /usr/lib/libapr-0.so.0
#14 0x0806534a in worker_thread ()
#15 0x4025cb06 in dummy_worker () from /usr/lib/libapr-0.so.0
#16 0x402d3941 in pthread_start_thread () from /lib/i686/libpthread.so.0

Thread 2 (Thread 16385 (LWP 24082)):
#0  0x420d224b in poll () from /lib/i686/libc.so.6
#1  0x402d2d9e in __pthread_manager () from /lib/i686/libpthread.so.0

Thread 1 (Thread 8192 (LWP 24080)):
#0  0x42028d69 in sigsuspend () from /lib/i686/libc.so.6
#1  0x402d5108 in __pthread_wait_for_restart_signal () from 
/lib/i686/libpthread.so.0
#2  0x402d2911 in pthread_join () from /lib/i686/libpthread.so.0
#3  0x4025ccbe in apr_thread_join () from /usr/lib/libapr-0.so.0
#4  0x08065903 in join_workers ()
#5  0x08065c75 in child_main ()
#6  0x08065ea3 in make_child ()
#7  0x08065f09 in startup_children ()
#8  0x0806693a in ap_mpm_run ()
#9  0x0806d6af in main ()
#10 0x420158d4 in __libc_start_main () from /lib/i686/libc.so.6



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: First httpd deadlock fixed, but found another one

Posted by Justin Erenkrantz <je...@apache.org>.
--On Tuesday, December 10, 2002 5:13 PM -0500 Brandon Ehle 
<az...@yahoo.com> wrote:

> Hrmm, are you sure?  I see it try to shutdown, then if that doesn't
> work it starts sending SIGTERMs to processes.  If the processes
> ignore the SIGTERM's, it will wait for them to finish which could
> be hours or never, unless I manually send a SIGKILL, which does the
> job, but the repository usually needs recovery at that point.

You are correct.  The worker MPM will wait infinitely for its 
children to come back.  If it never exits (due to an infinite loop in 
BDB), then that could be a problem.

Perhaps sending an email to the dev@httpd list asking about this 
would be beneficial.  Not sure how you would do this since the 
pthread_join call is usually blocking.  Perhaps doing a pthread_kill 
with SIGKILL beforehand might be helpful.  Anyway, dev@svn isn't the 
right forum for this part of the bug.  -- justin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: First httpd deadlock fixed, but found another one

Posted by Brandon Ehle <az...@yahoo.com>.
> 
>
>Huh?  When I shutdown Apache it first sends SIGTERM to the other
>processes and if that is ignored it sends SIGKILL.  It's not possible
>for a process to ignore SIGKILL at all, let alone hang around for 4
>hours.  What on earth are you doing?
>  
>
Hrmm, are you sure?  I see it try to shutdown, then if that doesn't work 
it starts sending SIGTERMs to processes.  If the processes ignore the 
SIGTERM's, it will wait for them to finish which could be hours or 
never, unless I manually send a SIGKILL, which does the job, but the 
repository usually needs recovery at that point.

>>>Did you do as I suggest and check your Apache and system logs for
>>>killed proceses?  It's quite possible that 8333 files is pushing
>>>Apache over a memory limit.
>>>      
>>>
>>No killed processes.
>>    
>>
>
>I just deliberately provoked an OOM on my Linux machine.  The syslog
>contains
>
>Dec 10 21:02:38 debian2 kernel: Out of Memory: Killed process 28928 (httpd).
>
>but there is no message in the Apache log, so it's easy to miss.
>
I've gotten some of these previously, but not this time.  (This machine 
has 2GB of RAM so it takes a bit to OOM it).

>>>Isn't this thread still processing a request?  One that would have
>>>been generated by your test?  Or did you fail to mention some other
>>>process that accesses the repository?
>>>      
>>>
>>See above, apache might have needed more than 4 hours to shutdown.
>>Next time I will give it more time.
>>    
>>
>I still don't understand why you think shutting down Apache would
>cause a dav_method_put request.  As far as I can tell a dav_method_put
>occurs when you commit a new or modified file, and as such I would
>assume it is part of a failed commit in your test.
>  
>
I don't think shutting down caused this request, this is just what is 
failing to shut down.

>If the hang is repeatable then I suggest you check that the database
>is not locked before shutting down Apache, run an 'svn ls URL' for
>example.  Of course if your test is hanging the database then it
>appears it is not detecting failed commands.  In that case it might be
>better to have the test not shut down Apache, and then check manually
>whether the database has hung.
>
Its very repeatable, it just takes a few hours and alot of data to get 
Apache into this state.



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: First httpd deadlock fixed, but found another one

Posted by Philip Martin <ph...@codematters.co.uk>.
Brandon Ehle <az...@yahoo.com> writes:

> Probably, one thing I've noticed since I ran this test is that when
> you commit anything over 3000 files its starts to take a really long
> time to shutdown apache.  It looks like its finalizing something with
> the db before it shuts down.  For the 8333 file case, its possible
> that it had been shutting down for 4 hours and hadn't finished yet
> when I got in this morning.  Next time I see that stack trace I'll
> verify whether that is whats happening.

Huh?  When I shutdown Apache it first sends SIGTERM to the other
processes and if that is ignored it sends SIGKILL.  It's not possible
for a process to ignore SIGKILL at all, let alone hang around for 4
hours.  What on earth are you doing?

> >Did you do as I suggest and check your Apache and system logs for
> >killed proceses?  It's quite possible that 8333 files is pushing
> >Apache over a memory limit.
> 
> No killed processes.

I just deliberately provoked an OOM on my Linux machine.  The syslog
contains

Dec 10 21:02:38 debian2 kernel: Out of Memory: Killed process 28928 (httpd).

but there is no message in the Apache log, so it's easy to miss.

> >Does your test verify that the operations it runs do in fact succeed?
> 
> It just checks svn's return values.
> 
> >Isn't this thread still processing a request?  One that would have
> >been generated by your test?  Or did you fail to mention some other
> >process that accesses the repository?
> 
> See above, apache might have needed more than 4 hours to shutdown.
> Next time I will give it more time.

I still don't understand why you think shutting down Apache would
cause a dav_method_put request.  As far as I can tell a dav_method_put
occurs when you commit a new or modified file, and as such I would
assume it is part of a failed commit in your test.

If the hang is repeatable then I suggest you check that the database
is not locked before shutting down Apache, run an 'svn ls URL' for
example.  Of course if your test is hanging the database then it
appears it is not detecting failed commands.  In that case it might be
better to have the test not shut down Apache, and then check manually
whether the database has hung.

-- 
Philip Martin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: First httpd deadlock fixed, but found another one

Posted by Brandon Ehle <az...@yahoo.com>.
> 
>
>Do you really think Apache shutdown caused the problem?  Thread 4
>appears to be blocked processing a dav_method_put request, which
>doesn't sound like something that would be invoked by shutdown.  I
>think it is more likely a killed process caused the deadlock sometime
>earlier, and you only noticed it when the shutdown blocked.  It looks
>like the same problem you reported before, just observed at a
>different place.
>  
>
Probably, one thing I've noticed since I ran this test is that when you 
commit anything over 3000 files its starts to take a really long time to 
shutdown apache.  It looks like its finalizing something with the db 
before it shuts down.  For the 8333 file case, its possible that it had 
been shutting down for 4 hours and hadn't finished yet when I got in 
this morning.  Next time I see that stack trace I'll verify whether that 
is whats happening.

>Did you do as I suggest and check your Apache and system logs for
>killed proceses?  It's quite possible that 8333 files is pushing
>Apache over a memory limit.
>  
>
No killed processes.

>Does your test verify that the operations it runs do in fact succeed?
>  
>
It just checks svn's return values.

>Isn't this thread still processing a request?  One that would have
>been generated by your test?  Or did you fail to mention some other
>process that accesses the repository?
>  
>
See above, apache might have needed more than 4 hours to shutdown.  Next 
time I will give it more time.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: First httpd deadlock fixed, but found another one

Posted by Philip Martin <ph...@codematters.co.uk>.
Brandon Ehle <az...@yahoo.com> writes:

> Here is the backtrace of the threads in apache, note that this is the
> backtrace after some of the other threads have exited.

Do you really think Apache shutdown caused the problem?  Thread 4
appears to be blocked processing a dav_method_put request, which
doesn't sound like something that would be invoked by shutdown.  I
think it is more likely a killed process caused the deadlock sometime
earlier, and you only noticed it when the shutdown blocked.  It looks
like the same problem you reported before, just observed at a
different place.

Did you do as I suggest and check your Apache and system logs for
killed proceses?  It's quite possible that 8333 files is pushing
Apache over a memory limit.

Does your test verify that the operations it runs do in fact succeed?

If you want to get more useful stack traces build Subversion and
Apache with debug information, i.e. add -g to the CFLAGS.

> (gdb) thread apply all bt
> 
> Thread 4 (Thread 139282 (LWP 24116)):
> #0  0x401f4ae9 in __memp_fopen_int () from /lib/libdb-4.1.so
> #1  0x401f45b4 in __memp_fopen () from /lib/libdb-4.1.so
> #2  0x401a1101 in __db_dbenv_setup () from /lib/libdb-4.1.so
> #3  0x401b3bf6 in __db_dbopen () from /lib/libdb-4.1.so
> #4  0x401b391f in __db_open () from /lib/libdb-4.1.so
> #5  0x40408ce5 in svn_fs__open_nodes_table () from /root/rpms/build/subversion/subversion/libsvn_fs/.libs/libsvn_fs-1.so.0
> #6  0x4040f8b3 in svn_fs_open_berkeley () from /root/rpms/build/subversion/subversion/libsvn_fs/.libs/libsvn_fs-1.so.0
> #7  0x403fcb77 in get_repos () from /root/rpms/build/subversion/subversion/libsvn_repos/.libs/libsvn_repos-1.so.0
> #8  0x403fccd5 in svn_repos_open () from /root/rpms/build/subversion/subversion/libsvn_repos/.libs/libsvn_repos-1.so.0
> #9  0x403e6b9a in dav_svn_get_resource () from /etc/httpd/modules/mod_dav_svn.so
> #10 0x4031aa4a in dav_get_resource () from /etc/httpd/modules/mod_dav.so
> #11 0x4031aeaf in dav_method_put () from /etc/httpd/modules/mod_dav.so
> #12 0x080684d5 in ap_run_handler ()
> #13 0x08068aed in ap_invoke_handler ()
> #14 0x080636a6 in ap_process_request ()
> #15 0x0805ecac in ap_process_http_connection ()
> #16 0x08071b65 in ap_run_process_connection ()

Isn't this thread still processing a request?  One that would have
been generated by your test?  Or did you fail to mention some other
process that accesses the repository?

> #17 0x08064b7c in process_socket ()
> #18 0x0806532d in worker_thread ()
> #19 0x4025cb06 in dummy_worker () from /usr/lib/libapr-0.so.0
> #20 0x402d3941 in pthread_start_thread () from /lib/i686/libpthread.so.0

-- 
Philip Martin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org