You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Philip Martin <ph...@codematters.co.uk> on 2012/11/01 13:29:55 UTC

Random serf checkout failures

I've been experimenting checking out subversion/tags from my local
mirror of the Subversion repository using a client and mod_dav_svn from
trunk.  I'm getting several different errors seemingly at random, I
suspect they may be related to some common underlying problem.

Sometimes I see a checksum error such as this:

../src/subversion/svn/checkout-cmd.c:168: (apr_err=200014)
../src/subversion/libsvn_client/checkout.c:179: (apr_err=200014)
../src/subversion/libsvn_client/update.c:579: (apr_err=200014)
../src/subversion/libsvn_client/update.c:440: (apr_err=200014)
../src/subversion/libsvn_wc/adm_crawler.c:858: (apr_err=200014)
../src/subversion/libsvn_ra_serf/update.c:2515: (apr_err=200014)
../src/subversion/libsvn_ra_serf/util.c:2023: (apr_err=200014)
../src/subversion/libsvn_ra_serf/util.c:2004: (apr_err=200014)
../src/subversion/libsvn_ra_serf/update.c:1065: (apr_err=200014)
../src/subversion/libsvn_ra_serf/update.c:895: (apr_err=200014)
../src/subversion/libsvn_wc/update_editor.c:3919: (apr_err=200014)
../src/subversion/libsvn_subr/checksum.c:469: (apr_err=200014)
svn: E200014: Checksum mismatch for '/home/pm/sw/subversion/obj/wc/0.25.0/subversion/libsvn_fs/tree.c':
   expected:  c38c3e8df15f36a45b8a626e4d82a25d
     actual:  2532722e781ef9afe673c89fd467d5f0

The expected checksum is the checksum in the repository.  It's a
different file each time this error occurs.

Another error is an assert:

Program received signal SIGABRT, Aborted.
0x00007ffff6530425 in __GI_raise (sig=<optimised out>)
    at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
64      ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  0x00007ffff6530425 in __GI_raise (sig=<optimised out>)
    at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00007ffff6533b8b in __GI_abort () at abort.c:91
#2  0x00007ffff6d41b78 in svn_error_abort_on_malfunction (can_return=1, 
    file=0x7ffff5a97058 "../src/subversion/libsvn_ra_serf/update.c", line=673, 
    expr=0x7ffff5a970fe "! dir->ref_count")
    at ../src/subversion/libsvn_subr/error.c:678
#3  0x00007ffff6d41bcc in svn_error__malfunction (can_return=1, 
    file=0x7ffff5a97058 "../src/subversion/libsvn_ra_serf/update.c", line=673, 
    expr=0x7ffff5a970fe "! dir->ref_count")
    at ../src/subversion/libsvn_subr/error.c:703
#4  0x00007ffff5a85d50 in close_all_dirs (dir=0x7fffe885f0c8)
    at ../src/subversion/libsvn_ra_serf/update.c:673
#5  0x00007ffff5a85cda in close_all_dirs (dir=0x7fffe88d00c8)
    at ../src/subversion/libsvn_ra_serf/update.c:669
#6  0x00007ffff5a85cda in close_all_dirs (dir=0x7fffe89860c8)
    at ../src/subversion/libsvn_ra_serf/update.c:669
#7  0x00007ffff5a85cda in close_all_dirs (dir=0x7ffff7f2a0c8)
    at ../src/subversion/libsvn_ra_serf/update.c:669
#8  0x00007ffff5a8b01b in finish_report (report_baton=0x7ffff7fb8060, 
    pool=0x7ffff7fb5028) at ../src/subversion/libsvn_ra_serf/update.c:2665
#9  0x00007ffff76945cc in svn_wc_crawl_revisions5 (wc_ctx=0x7ffff7f670e0, 
    local_abspath=0x7ffff7fb5168 "/home/pm/sw/subversion/obj/wc", 
    reporter=0x7ffff5ca07c0, report_baton=0x7ffff7fb8060, restore_files=1, 
    depth=svn_depth_unknown, honor_depth_exclude=1, 
    depth_compatibility_trick=0, use_commit_times=0, 
    cancel_func=0x416b1b <svn_cl__check_cancel>, cancel_baton=0x0, 
    notify_func=0x41c703 <notify>, notify_baton=0x7ffff7f673d0, 
    scratch_pool=0x7ffff7fb5028)
    at ../src/subversion/libsvn_wc/adm_crawler.c:858

This appears to be in the end of the checkout code but the working copy
is far from complete.

I have also seen this error:

../src/subversion/svn/checkout-cmd.c:168: (apr_err=175009)
../src/subversion/libsvn_client/checkout.c:179: (apr_err=175009)
../src/subversion/libsvn_client/update.c:579: (apr_err=175009)
../src/subversion/libsvn_client/update.c:440: (apr_err=175009)
../src/subversion/libsvn_wc/adm_crawler.c:858: (apr_err=175009)
../src/subversion/libsvn_ra_serf/update.c:2515: (apr_err=175009)
../src/subversion/libsvn_ra_serf/util.c:2028: (apr_err=175009)
../src/subversion/libsvn_ra_serf/util.c:1850: (apr_err=175009)
svn: E175009: Premature EOF seen from server (http status=200)
../src/subversion/libsvn_ra_serf/util.c:1850: (apr_err=70014)
svn: E070014: End of file found

All of these look like some problem with the data transfer, serf losing
data or mod_dav_svn not sending it, but I suppose they could also be
caused by memory corruption.  These errors generally occur when there
are thousands of files in the pristine store and tens of thousands of
files in the working copy.

I've seen these error while checking out across my LAN and on localhost
so I don't think it can be network problem.  I'm using serf 1.1.x@1674
and subversion/mod_dav_svn from trunk (currently r1404529 but I was
seeing these errors a couple of weeks ago as well).  I'm seeing these
errors on Debian/stable and Ubuntu/12.04.

-- 
Philip

Re: Random serf checkout failures

Posted by Philip Martin <ph...@wandisco.com>.

Lieven Govaerts <sv...@mobsol.be> writes:

> Are you using https or http?

http

> I've done similar tests with success recently, but that was on
> Windows. This was to the European mirror of the apache repository
> though, I don't have a local mirror available.

I suppose you could create a data set by importingd trunk a hundred
times modifying about 200 files each time.

> Are you willing to use serf trunk instead of 1.1.x? I've added logging
> to serf trunk, which has helped me a lot in analyzing serf behavior.
> The relevant flags are in serf_private.h. I'd set SOCK_VERBOSE,
> CONN_VERBOSE, SSL_VERBOSE and one of SSL_MSG_VERBOSE or
> SOCK_MSG_VERBOSE. All output is written to stderr so easy to redirect
> to file.

I can reproduce with trunk.  Here's the end of the log for a checksum
failure.  It appears to simply stop receiving data in the middle of a
file, hence the checksum failure.  The 70014 error is APR_EOF.

HTTP/1.1 200 OK
Date: Fri, 02 Nov 2012 21:56:55 GMT
Server: Apache/2.2.22 (Ubuntu) mod_ssl/2.2.22 OpenSSL/1.0.1 DAV/2 SVN/1.8.0-dev
Last-Modified: Tue, 24 Jul 2001 18:49:22 GMT
ETag: "839773//subversion/tags/0.9.0/subversion/libsvn_repos/hooks.txt"
Cache-Control: max-age=604800
Accept-Ranges: bytes
Content-Length: 9192
Content-Type: text/plain

                           Repository Hooks
                           ================

    ... text elided ...


   The point of sentinels is to provide real-time feedback as a commit
   is being built (or even before the txn is started), or as a
   checkout or update is being produced -- but without the overhead of
   st
-(5268)-
[2012-11-02T21:57:11.198912+00] [l:0.0.0.0:0 r:192.168.0.102:8888] buckets/socket_buckets.c: socket_recv error 70014
../src/subversion/svn/checkout-cmd.c:168: (apr_err=200014)
../src/subversion/libsvn_client/checkout.c:179: (apr_err=200014)
../src/subversion/libsvn_client/update.c:579: (apr_err=200014)
../src/subversion/libsvn_client/update.c:440: (apr_err=200014)
../src/subversion/libsvn_wc/adm_crawler.c:858: (apr_err=200014)
../src/subversion/libsvn_ra_serf/update.c:2515: (apr_err=200014)
../src/subversion/libsvn_ra_serf/util.c:2028: (apr_err=200014)
../src/subversion/libsvn_ra_serf/util.c:2009: (apr_err=200014)
../src/subversion/libsvn_ra_serf/update.c:1065: (apr_err=200014)
../src/subversion/libsvn_ra_serf/update.c:895: (apr_err=200014)
../src/subversion/libsvn_wc/update_editor.c:3921: (apr_err=200014)
../src/subversion/libsvn_subr/checksum.c:469: (apr_err=200014)
svn: E200014: Checksum mismatch for '/home/pm/sw/subversion/obj/wc/0.9.0/subversion/libsvn_repos/hooks.txt':
   expected:  f43fbb1f0c8a6ab22e12a540cbda4aca
     actual:  835e68f9025e7f567d68be840bef249c

[2012-11-02T21:57:16.321729+00] [l:0.0.0.0:0 r:192.168.0.102:8888] outgoing.c: cleanup - closed socket, status 9
[2012-11-02T21:57:16.321796+00] [l:0.0.0.0:0 r:192.168.0.102:8888] outgoing.c: cleanup - closed socket, status 9
[2012-11-02T21:57:16.321840+00] [l:0.0.0.0:0 r:192.168.0.102:8888] outgoing.c: cleanup - closed socket, status 9
[2012-11-02T21:57:16.321873+00] outgoing.c: cleaning up connection 0x25e8148
[2012-11-02T21:57:16.322017+00] outgoing.c: closed connection 0x25e8148
[2012-11-02T21:57:16.322044+00] outgoing.c: cleaning up connection 0x2526b58
[2012-11-02T21:57:16.322057+00] outgoing.c: closed connection 0x2526b58
[2012-11-02T21:57:16.322087+00] outgoing.c: cleaning up connection 0x247a768
[2012-11-02T21:57:16.322104+00] outgoing.c: closed connection 0x247a768
[2012-11-02T21:57:16.322125+00] outgoing.c: cleaning up connection 0x23eb838
[2012-11-02T21:57:16.322136+00] outgoing.c: closed connection 0x23eb838

Here's another failure.  I was using the default Apache timeouts which
should be 300s and this whole checkout was less than that (from 22:02:54
to 22:04:03).  I've upped the timeout to 3000s and it still fails:

[2012-11-02T22:04:01.384267+00] [l:0.0.0.0:0 r:192.168.0.102:8888] buckets/socket_buckets.c: --- socket_recv:
HTTP/1.1 408 Request Time-out
Date: Fri, 02 Nov 2012 22:03:57 GMT
Server: Apache/2.2.22 (Ubuntu) mod_ssl/2.2.22 OpenSSL/1.0.1 DAV/2 SVN/1.8.0-dev
Content-Length: 223
Connection: close
Content-Type: text/html; charset=iso-8859-1

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>408 Request Time-out</title>
</head><body>
<h1>Request Time-out</h1>
<p>Server timeout waiting for the HTTP request from the client.</p>
</body></html>

-(459)-
../src/subversion/svn/checkout-cmd.c:168: (apr_err=20014)
../src/subversion/libsvn_client/checkout.c:179: (apr_err=20014)
../src/subversion/libsvn_client/update.c:579: (apr_err=20014)
../src/subversion/libsvn_client/update.c:440: (apr_err=20014)
../src/subversion/libsvn_wc/adm_crawler.c:858: (apr_err=20014)
../src/subversion/libsvn_ra_serf/update.c:2518: (apr_err=20014)
svn: E020014: Error retrieving REPORT: Internal error
[2012-11-02T22:04:03.023422+00] [l:0.0.0.0:0 r:192.168.0.102:8888] outgoing.c: cleanup - closed socket, status 9
[2012-11-02T22:04:03.023532+00] [l:0.0.0.0:0 r:192.168.0.102:8888] outgoing.c: cleanup - closed socket, status 9
[2012-11-02T22:04:03.023655+00] [l:0.0.0.0:0 r:192.168.0.102:8888] outgoing.c: cleanup - closed socket, status 9
[2012-11-02T22:04:03.023727+00] [l:0.0.0.0:0 r:192.168.0.102:8888] outgoing.c: cleanup - closed socket, status 9
[2012-11-02T22:04:03.023782+00] outgoing.c: cleaning up connection 0x2237288
[2012-11-02T22:04:03.023835+00] outgoing.c: closed connection 0x2237288
[2012-11-02T22:04:03.023860+00] outgoing.c: cleaning up connection 0x2199db8
[2012-11-02T22:04:03.023881+00] outgoing.c: closed connection 0x2199db8
[2012-11-02T22:04:03.023906+00] outgoing.c: cleaning up connection 0x20ecb28
[2012-11-02T22:04:03.023965+00] outgoing.c: closed connection 0x20ecb28
[2012-11-02T22:04:03.024011+00] outgoing.c: cleaning up connection 0x2084838
[2012-11-02T22:04:03.024032+00] outgoing.c: closed connection 0x2084838


-- 
Certified & Supported Apache Subversion Downloads:
http://www.wandisco.com/subversion/download

Re: Random serf checkout failures

Posted by Lieven Govaerts <sv...@mobsol.be>.

Philip,

On Thu, Nov 1, 2012 at 1:29 PM, Philip Martin <ph...@codematters.co.uk> wrote:
> I've been experimenting checking out subversion/tags from my local
> mirror of the Subversion repository using a client and mod_dav_svn from
> trunk.  I'm getting several different errors seemingly at random, I
> suspect they may be related to some common underlying problem.
>
> Sometimes I see a checksum error such as this:
>
> ../src/subversion/svn/checkout-cmd.c:168: (apr_err=200014)
> ../src/subversion/libsvn_client/checkout.c:179: (apr_err=200014)
> ../src/subversion/libsvn_client/update.c:579: (apr_err=200014)
> ../src/subversion/libsvn_client/update.c:440: (apr_err=200014)
> ../src/subversion/libsvn_wc/adm_crawler.c:858: (apr_err=200014)
> ../src/subversion/libsvn_ra_serf/update.c:2515: (apr_err=200014)
> ../src/subversion/libsvn_ra_serf/util.c:2023: (apr_err=200014)
> ../src/subversion/libsvn_ra_serf/util.c:2004: (apr_err=200014)
> ../src/subversion/libsvn_ra_serf/update.c:1065: (apr_err=200014)
> ../src/subversion/libsvn_ra_serf/update.c:895: (apr_err=200014)
> ../src/subversion/libsvn_wc/update_editor.c:3919: (apr_err=200014)
> ../src/subversion/libsvn_subr/checksum.c:469: (apr_err=200014)
> svn: E200014: Checksum mismatch for '/home/pm/sw/subversion/obj/wc/0.25.0/subversion/libsvn_fs/tree.c':
>    expected:  c38c3e8df15f36a45b8a626e4d82a25d
>      actual:  2532722e781ef9afe673c89fd467d5f0
>
> The expected checksum is the checksum in the repository.  It's a
> different file each time this error occurs.
>
> Another error is an assert:
>
> Program received signal SIGABRT, Aborted.
> 0x00007ffff6530425 in __GI_raise (sig=<optimised out>)
>     at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
> 64      ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
> (gdb) bt
> #0  0x00007ffff6530425 in __GI_raise (sig=<optimised out>)
>     at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
> #1  0x00007ffff6533b8b in __GI_abort () at abort.c:91
> #2  0x00007ffff6d41b78 in svn_error_abort_on_malfunction (can_return=1,
>     file=0x7ffff5a97058 "../src/subversion/libsvn_ra_serf/update.c", line=673,
>     expr=0x7ffff5a970fe "! dir->ref_count")
>     at ../src/subversion/libsvn_subr/error.c:678
> #3  0x00007ffff6d41bcc in svn_error__malfunction (can_return=1,
>     file=0x7ffff5a97058 "../src/subversion/libsvn_ra_serf/update.c", line=673,
>     expr=0x7ffff5a970fe "! dir->ref_count")
>     at ../src/subversion/libsvn_subr/error.c:703
> #4  0x00007ffff5a85d50 in close_all_dirs (dir=0x7fffe885f0c8)
>     at ../src/subversion/libsvn_ra_serf/update.c:673
> #5  0x00007ffff5a85cda in close_all_dirs (dir=0x7fffe88d00c8)
>     at ../src/subversion/libsvn_ra_serf/update.c:669
> #6  0x00007ffff5a85cda in close_all_dirs (dir=0x7fffe89860c8)
>     at ../src/subversion/libsvn_ra_serf/update.c:669
> #7  0x00007ffff5a85cda in close_all_dirs (dir=0x7ffff7f2a0c8)
>     at ../src/subversion/libsvn_ra_serf/update.c:669
> #8  0x00007ffff5a8b01b in finish_report (report_baton=0x7ffff7fb8060,
>     pool=0x7ffff7fb5028) at ../src/subversion/libsvn_ra_serf/update.c:2665
> #9  0x00007ffff76945cc in svn_wc_crawl_revisions5 (wc_ctx=0x7ffff7f670e0,
>     local_abspath=0x7ffff7fb5168 "/home/pm/sw/subversion/obj/wc",
>     reporter=0x7ffff5ca07c0, report_baton=0x7ffff7fb8060, restore_files=1,
>     depth=svn_depth_unknown, honor_depth_exclude=1,
>     depth_compatibility_trick=0, use_commit_times=0,
>     cancel_func=0x416b1b <svn_cl__check_cancel>, cancel_baton=0x0,
>     notify_func=0x41c703 <notify>, notify_baton=0x7ffff7f673d0,
>     scratch_pool=0x7ffff7fb5028)
>     at ../src/subversion/libsvn_wc/adm_crawler.c:858
>
> This appears to be in the end of the checkout code but the working copy
> is far from complete.
>
> I have also seen this error:
>
> ../src/subversion/svn/checkout-cmd.c:168: (apr_err=175009)
> ../src/subversion/libsvn_client/checkout.c:179: (apr_err=175009)
> ../src/subversion/libsvn_client/update.c:579: (apr_err=175009)
> ../src/subversion/libsvn_client/update.c:440: (apr_err=175009)
> ../src/subversion/libsvn_wc/adm_crawler.c:858: (apr_err=175009)
> ../src/subversion/libsvn_ra_serf/update.c:2515: (apr_err=175009)
> ../src/subversion/libsvn_ra_serf/util.c:2028: (apr_err=175009)
> ../src/subversion/libsvn_ra_serf/util.c:1850: (apr_err=175009)
> svn: E175009: Premature EOF seen from server (http status=200)
> ../src/subversion/libsvn_ra_serf/util.c:1850: (apr_err=70014)
> svn: E070014: End of file found
>
> All of these look like some problem with the data transfer, serf losing
> data or mod_dav_svn not sending it, but I suppose they could also be
> caused by memory corruption.  These errors generally occur when there
> are thousands of files in the pristine store and tens of thousands of
> files in the working copy.
>
> I've seen these error while checking out across my LAN and on localhost
> so I don't think it can be network problem.  I'm using serf 1.1.x@1674
> and subversion/mod_dav_svn from trunk (currently r1404529 but I was
> seeing these errors a couple of weeks ago as well).  I'm seeing these
> errors on Debian/stable and Ubuntu/12.04.

Are you using https or http?

I've done similar tests with success recently, but that was on
Windows. This was to the European mirror of the apache repository
though, I don't have a local mirror available.

Are you willing to use serf trunk instead of 1.1.x? I've added logging
to serf trunk, which has helped me a lot in analyzing serf behavior.
The relevant flags are in serf_private.h. I'd set SOCK_VERBOSE,
CONN_VERBOSE, SSL_VERBOSE and one of SSL_MSG_VERBOSE or
SOCK_MSG_VERBOSE. All output is written to stderr so easy to redirect
to file.


>
> --
> Philip

Lieven

AW: Random serf checkout failures

Posted by Markus Schaber <m....@codesys.com>.

Hi, Lieven,

Von: lieven.govaerts@gmail.com [mailto:lieven.govaerts@gmail.com] Im
> On Thu, Nov 8, 2012 at 12:02 PM, Markus Schaber <m....@codesys.com>
> wrote:
> > Von: lieven.govaerts@gmail.com
> > [....]
> >> 5. In one of Philip error-reporting mails, there was mention of a
> >> "408 Request Time-out" response.
> >> (http://svn.haxx.se/dev/archive-2012-11/0076.shtml).
> >> No idea if the server sends this response because one of the above
> >> issues, or if this has a different root caus. What I do know is that
> >> serf does not expect this response as it's not related to an
> >> outstanding request. In that case, serf will return and APR_GENERAL error.
> >>
> >> I think 3+4+5 are relatively easy to solve, but probably difficult to
> >> reproduce to validate the implemented solutions.
> >> 1a and 2 require more analysis, I'll try to focus on these two.
> >
> > I think 5 might be testable with a special version of mod_dontdothat
> which injects that error code, maybe conditionally or randomly.
> 
> I've found a way to trigger the issue with some modifications to serf, see
> the new email thread:
> http://svn.haxx.se/dev/archive-2012-11/0238.shtml

I'm happy to hear that.

Another possibility we could keep in mind for future incidents might be a special http proxy for fault injection.
http://spareclockcycles.org/2010/06/10/sergio-proxy-released/
http://extradata.com/products/FaultFactory/
https://www.owasp.org/index.php/Category%3aOWASP_WebScarab_Project



Best regards

Markus Schaber

CODESYS(r) a trademark of 3S-Smart Software Solutions GmbH

Inspiring Automation Solutions

3S-Smart Software Solutions GmbH
Dipl.-Inf. Markus Schaber | Product Development Core Technology
Memminger Str. 151 | 87439 Kempten | Germany
Tel. +49-831-54031-979 | Fax +49-831-54031-50

E-Mail: m.schaber@codesys.com | Web: http://www.codesys.com
CODESYS internet forum: http://forum.codesys.com

Managing Directors: Dipl.Inf. Dieter Hess, Dipl.Inf. Manfred Werner | Trade register: Kempten HRB 6186 | Tax ID No.: DE 167014915

Re: Random serf checkout failures

Posted by Lieven Govaerts <lg...@apache.org>.

Hi Markus,

On Thu, Nov 8, 2012 at 12:02 PM, Markus Schaber <m....@codesys.com> wrote:
> Hi, Lieven,
>
> Von: lieven.govaerts@gmail.com
> [....]
>> 5. In one of Philip error-reporting mails, there was mention of a "408
>> Request Time-out" response.
>> (http://svn.haxx.se/dev/archive-2012-11/0076.shtml).
>> No idea if the server sends this response because one of the above issues,
>> or if this has a different root caus. What I do know is that serf does not
>> expect this response as it's not related to an outstanding request. In
>> that case, serf will return and APR_GENERAL error.
>>
>> I think 3+4+5 are relatively easy to solve, but probably difficult to
>> reproduce to validate the implemented solutions.
>> 1a and 2 require more analysis, I'll try to focus on these two.
>
> I think 5 might be testable with a special version of mod_dontdothat which injects that error code, maybe conditionally or randomly.
>

I've found a way to trigger the issue with some modifications to serf,
see the new email thread:
http://svn.haxx.se/dev/archive-2012-11/0238.shtml


> Best regards
>
> Markus Schaber
>

Lieven

AW: Random serf checkout failures

Posted by Markus Schaber <m....@codesys.com>.

Hi, Lieven,

Von: lieven.govaerts@gmail.com 
[....]
> 5. In one of Philip error-reporting mails, there was mention of a "408
> Request Time-out" response.
> (http://svn.haxx.se/dev/archive-2012-11/0076.shtml).
> No idea if the server sends this response because one of the above issues,
> or if this has a different root caus. What I do know is that serf does not
> expect this response as it's not related to an outstanding request. In
> that case, serf will return and APR_GENERAL error.
> 
> I think 3+4+5 are relatively easy to solve, but probably difficult to
> reproduce to validate the implemented solutions.
> 1a and 2 require more analysis, I'll try to focus on these two.

I think 5 might be testable with a special version of mod_dontdothat which injects that error code, maybe conditionally or randomly.

Best regards

Markus Schaber

CODESYS(r) a trademark of 3S-Smart Software Solutions GmbH

Inspiring Automation Solutions

3S-Smart Software Solutions GmbH
Dipl.-Inf. Markus Schaber | Product Development Core Technology
Memminger Str. 151 | 87439 Kempten | Germany
Tel. +49-831-54031-979 | Fax +49-831-54031-50

E-Mail: m.schaber@codesys.com | Web: http://www.codesys.com
CODESYS internet forum: http://forum.codesys.com

Managing Directors: Dipl.Inf. Dieter Hess, Dipl.Inf. Manfred Werner | Trade register: Kempten HRB 6186 | Tax ID No.: DE 167014915

Re: Random serf checkout failures

Posted by Lieven Govaerts <lg...@apache.org>.

Hi,

we seem to be raising and discussing a few different issues in this
thread. I've tried to summarize current status with some added
observations.
I propose to start up separate mail threads when discussing one of
these specific issues.

1. When a very low Timeout is set in the apache server configuration,
the server truncates response bodies in some situations:
   1a. I see this happening with responses to GET requests, but it's
not clear yet why this happens, and if we need to worry about it, as a
3 second timeout where the default is 300 secs is really low. Philip
has reported this happening with a REPORT response
(http://svn.haxx.se/dev/archive-2012-11/0097.shtml).
   1b. If the server aborts a connection, ra_serf doesn't interpret
this as an error. This results in many different types of error
messages, like invalid checksums etc. A check has been added to serf
trunk r1678 that will raise an error if the response body is shorter
than expected.

2. ra_serf consumes way too much memory while doing large checkout's
(issue #4196). This is caused (at least partially) by ra_serf not
closing empty directories until the very end of the editor run,
thereby keeping info objects in memory. Mike & Justin have tried to
fix this in r1406366.
In my opinion, the '! dir->ref_count in close_all_dirs' asserts that
Philip keeps seeing (but I can't reproduce) are related to this
specific issue, and have not been fixed by r1406366.

3. Ivan added a potential issue because ra_serf+serf are not handling
events on each open connection regularly. This can result in timeouts,
where the server closes a connection because it thinks the client
isn't responding anymore.
If this happens on one of the connections where the GET/PROPFIND
requests are sent, I don't think this will break the update/checkout:
the server will close the connection and serf will resend the requests
on a new connection. However, if such a timeout occurs on the
connection where ra_serf is reading a large REPORT response, the
response will be truncated and svn will stop with an error.
Two observations:
   3a. When processing connections, serf will read as much data from
one connection as the network allows it to. Only when it can't read
more from the connection, it will go on with the next connection. Ivan
has provided a patch to limit the amount of responses read per
serf_context_run call per connection to 4.
(http://svn.haxx.se/dev/archive-2012-11/0193.shtml)
   3b. The same observation as 3a, but then applied to the connection
where we receive the REPORT response: ra_serf will keep reading as
much data as possible, and create new GET/PROPFIND requests as a
result, without giving serf the time to send these new requests on
other connections.
I don't think this can result in svn stopping, but might result in
stalls as we don't get enough requests out to fetch files.

4. Ivan further observed that the spillbuf mechanism, used to keep
ra_serf reading from the REPORT connection while waiting for slow disc
i/o, has two drawbacks:
    4a. the amount of data stored in the spillbuf can get very large
when processing the REPORT response is paused. When resuming, all data
in the spillbuf is read and processed in one go. As this can be 1MB+
it can take a while, potentially timing out the REPORT response
connection -> svn will stop.
    4b. The GET and PROPFIND requests resulting from parsing that big
chunk of data will all be sent on the same connection, where we have 4
available to send more requests in parallel.

5. In one of Philip error-reporting mails, there was mention of a "408
Request Time-out" response.
(http://svn.haxx.se/dev/archive-2012-11/0076.shtml).
No idea if the server sends this response because one of the above
issues, or if this has a different root caus. What I do know is that
serf does not expect this response as it's not related to an
outstanding request. In that case, serf will return and APR_GENERAL
error.

I think 3+4+5 are relatively easy to solve, but probably difficult to
reproduce to validate the implemented solutions.
1a and 2 require more analysis, I'll try to focus on these two.

Lieven

On Thu, Nov 1, 2012 at 1:29 PM, Philip Martin <ph...@codematters.co.uk> wrote:
> I've been experimenting checking out subversion/tags from my local
> mirror of the Subversion repository using a client and mod_dav_svn from
> trunk.  I'm getting several different errors seemingly at random, I
> suspect they may be related to some common underlying problem.
>
> Sometimes I see a checksum error such as this:
>
> ../src/subversion/svn/checkout-cmd.c:168: (apr_err=200014)
> ../src/subversion/libsvn_client/checkout.c:179: (apr_err=200014)
> ../src/subversion/libsvn_client/update.c:579: (apr_err=200014)
> ../src/subversion/libsvn_client/update.c:440: (apr_err=200014)
> ../src/subversion/libsvn_wc/adm_crawler.c:858: (apr_err=200014)
> ../src/subversion/libsvn_ra_serf/update.c:2515: (apr_err=200014)
> ../src/subversion/libsvn_ra_serf/util.c:2023: (apr_err=200014)
> ../src/subversion/libsvn_ra_serf/util.c:2004: (apr_err=200014)
> ../src/subversion/libsvn_ra_serf/update.c:1065: (apr_err=200014)
> ../src/subversion/libsvn_ra_serf/update.c:895: (apr_err=200014)
> ../src/subversion/libsvn_wc/update_editor.c:3919: (apr_err=200014)
> ../src/subversion/libsvn_subr/checksum.c:469: (apr_err=200014)
> svn: E200014: Checksum mismatch for '/home/pm/sw/subversion/obj/wc/0.25.0/subversion/libsvn_fs/tree.c':
>    expected:  c38c3e8df15f36a45b8a626e4d82a25d
>      actual:  2532722e781ef9afe673c89fd467d5f0
>
> The expected checksum is the checksum in the repository.  It's a
> different file each time this error occurs.
>
> Another error is an assert:
>
> Program received signal SIGABRT, Aborted.
> 0x00007ffff6530425 in __GI_raise (sig=<optimised out>)
>     at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
> 64      ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
> (gdb) bt
> #0  0x00007ffff6530425 in __GI_raise (sig=<optimised out>)
>     at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
> #1  0x00007ffff6533b8b in __GI_abort () at abort.c:91
> #2  0x00007ffff6d41b78 in svn_error_abort_on_malfunction (can_return=1,
>     file=0x7ffff5a97058 "../src/subversion/libsvn_ra_serf/update.c", line=673,
>     expr=0x7ffff5a970fe "! dir->ref_count")
>     at ../src/subversion/libsvn_subr/error.c:678
> #3  0x00007ffff6d41bcc in svn_error__malfunction (can_return=1,
>     file=0x7ffff5a97058 "../src/subversion/libsvn_ra_serf/update.c", line=673,
>     expr=0x7ffff5a970fe "! dir->ref_count")
>     at ../src/subversion/libsvn_subr/error.c:703
> #4  0x00007ffff5a85d50 in close_all_dirs (dir=0x7fffe885f0c8)
>     at ../src/subversion/libsvn_ra_serf/update.c:673
> #5  0x00007ffff5a85cda in close_all_dirs (dir=0x7fffe88d00c8)
>     at ../src/subversion/libsvn_ra_serf/update.c:669
> #6  0x00007ffff5a85cda in close_all_dirs (dir=0x7fffe89860c8)
>     at ../src/subversion/libsvn_ra_serf/update.c:669
> #7  0x00007ffff5a85cda in close_all_dirs (dir=0x7ffff7f2a0c8)
>     at ../src/subversion/libsvn_ra_serf/update.c:669
> #8  0x00007ffff5a8b01b in finish_report (report_baton=0x7ffff7fb8060,
>     pool=0x7ffff7fb5028) at ../src/subversion/libsvn_ra_serf/update.c:2665
> #9  0x00007ffff76945cc in svn_wc_crawl_revisions5 (wc_ctx=0x7ffff7f670e0,
>     local_abspath=0x7ffff7fb5168 "/home/pm/sw/subversion/obj/wc",
>     reporter=0x7ffff5ca07c0, report_baton=0x7ffff7fb8060, restore_files=1,
>     depth=svn_depth_unknown, honor_depth_exclude=1,
>     depth_compatibility_trick=0, use_commit_times=0,
>     cancel_func=0x416b1b <svn_cl__check_cancel>, cancel_baton=0x0,
>     notify_func=0x41c703 <notify>, notify_baton=0x7ffff7f673d0,
>     scratch_pool=0x7ffff7fb5028)
>     at ../src/subversion/libsvn_wc/adm_crawler.c:858
>
> This appears to be in the end of the checkout code but the working copy
> is far from complete.
>
> I have also seen this error:
>
> ../src/subversion/svn/checkout-cmd.c:168: (apr_err=175009)
> ../src/subversion/libsvn_client/checkout.c:179: (apr_err=175009)
> ../src/subversion/libsvn_client/update.c:579: (apr_err=175009)
> ../src/subversion/libsvn_client/update.c:440: (apr_err=175009)
> ../src/subversion/libsvn_wc/adm_crawler.c:858: (apr_err=175009)
> ../src/subversion/libsvn_ra_serf/update.c:2515: (apr_err=175009)
> ../src/subversion/libsvn_ra_serf/util.c:2028: (apr_err=175009)
> ../src/subversion/libsvn_ra_serf/util.c:1850: (apr_err=175009)
> svn: E175009: Premature EOF seen from server (http status=200)
> ../src/subversion/libsvn_ra_serf/util.c:1850: (apr_err=70014)
> svn: E070014: End of file found
>
> All of these look like some problem with the data transfer, serf losing
> data or mod_dav_svn not sending it, but I suppose they could also be
> caused by memory corruption.  These errors generally occur when there
> are thousands of files in the pristine store and tens of thousands of
> files in the working copy.
>
> I've seen these error while checking out across my LAN and on localhost
> so I don't think it can be network problem.  I'm using serf 1.1.x@1674
> and subversion/mod_dav_svn from trunk (currently r1404529 but I was
> seeing these errors a couple of weeks ago as well).  I'm seeing these
> errors on Debian/stable and Ubuntu/12.04.
>
> --
> Philip

Re: Random serf checkout failures

Posted by Philip Martin <ph...@wandisco.com>.

Lieven Govaerts <lg...@mobsol.be> writes:

> My current test with svn trunk & apache+mod_dav_svn 1.6.12 does not
> reproduce this issue.

I'm seeing the problem with trunk mod_dav_svn.

> What I am seeing is that svn is using a lot of memory, up to 1GB.
> Currently looking at the root cause there.

http://subversion.tigris.org/issues/show_bug.cgi?id=4194

-- 
Certified & Supported Apache Subversion Downloads:
http://www.wandisco.com/subversion/download

Re: Random serf checkout failures

Posted by "C. Michael Pilato" <cm...@collab.net>.

On 11/05/2012 04:21 PM, Lieven Govaerts wrote:
> Do you plan on patching this issue? If yes I can continue looking at
> Philip's original issue in this thread - assuming it is not caused by
> the memory leak.

Thanks for the helpful cross-checking!  Yes, I plan to continue looking into
this issue.

-- 
C. Michael Pilato <cm...@collab.net>
CollabNet   <>   www.collab.net   <>   Enterprise Cloud Development

Re: Random serf checkout failures

Posted by Lieven Govaerts <lg...@mobsol.be>.

Hi Mike

On Mon, Nov 5, 2012 at 9:39 PM, C. Michael Pilato <cm...@collab.net> wrote:
> On 11/05/2012 01:33 PM, C. Michael Pilato wrote:
>> On 11/05/2012 11:49 AM, Lieven Govaerts wrote:
>>> My current test with svn trunk & apache+mod_dav_svn 1.6.12 does not
>>> reproduce this issue.
>>>
>>> What I am seeing is that svn is using a lot of memory, up to 1GB.
>>> Currently looking at the root cause there.
>>>
>>> First clue: many directories are not closed until the call to
>>> close_all_dirs in libsvn_ra_serf/update.c:finish_report. This means
>>> that for the most part of the checkout, the associated info object is
>>> kept in memory.
>>
>> My debugging indicates that when close_all_dirs() is called, there are a
>> slew of unclosed directories which each have a non-zero ref_count.  Save for
>> the root directory of the edit (/tags), fetch_props == TRUE,
>> propfind_handler->done == TRUE, and tag_closed == TRUE for all open child
>> directories.  So I'm looking around at how file references are managed to
>> see if anything has leaked.  I might have inadvertently botched something
>> when adding the pull-contents-from-the-wc-cache logic.
>
> So ... if I'm reading the ra_serf code correctly (and let me tell you,
> there's a good chance I'm not), the only time we opportunistically close
> directory batons is when we're processing the current list of completed file
> fetches.  See the code which follows this comment:
>
>           /* If we have a valid directory and
>            * we have no open items in this dir and
>            * we've closed the directory tag (no more children can be added)
>            * and either:
>            *   we know we won't be fetching props or
>            *   we've already completed the propfind
>            * then, we know it's time for us to close this directory.
>            */
>
> That code is all encompassed within an outer loop that looks like so:
>
>       /* prune our fetches list if they are done. */
>       done_list = report->done_fetches;
>       while (done_list)
>         {
>
> And all that is nested within yet another loop that runs until the REPORT
> response is all parse and all auxiliary GETs and PROPFINDs are finished.
>
> So, back to the opportunistic directory closure:  *if* we've finished any
> GETs since the last time we checked, and if ${STUFF_IN_THAT_BIG_COMMENT},
> then we try to close the parent directory of the file we fetched, plus any
> other close-able parent directories thereof.
>
> But what about when there are no file content fetches to manage?
>
> Using a trunk client and 1.7.x server, I just tried to export a tree with
> 15k directories (10 wide at the top, 15 or so below that, and 100 or so
> below those) and no files.  153 Mb of memory peak usage, and not a single
> directory closure until the whole tree had been processed/PROPFINDed/etc.
>
> Unless I'm mistaken, we will delay until the very end the closure of every
> directory (and by refcount-management-extension, every parent thereof) for
> which we don't execute at least one GET for an immediately child thereof.

With Philip's test scenario, you can see this clearly as the 3 empty
directories in svn trunk are only closed at the very end:
..
A    /tmp/tags/52/subversion/bindings/swig/proxy
A    /tmp/tags/51/subversion/tests/cmdline/import_tests_data/import_tree/DIR1.noo
A    /tmp/tags/51/subversion/tests/cmdline/import_tests_data/import_tree/DIR6/DIR7/DIR8.noo
A    /tmp/tags/51/subversion/bindings/swig/proxy
A    /tmp/tags/50/subversion/tests/cmdline/import_tests_data/import_tree/DIR1.noo
A    /tmp/tags/50/subversion/tests/cmdline/import_tests_data/import_tree/DIR6/DIR7/DIR8.noo
A    /tmp/tags/50/subversion/bindings/swig/proxy
Checked out revision 151.

> I expect this situation is not new to trunk.  (In fact, repeating the above
> export with a 1.7 client, I see a peak memory using of 458 Mb!)  Memory
> usage improvements in trunk help matters.  But then, the feature which
> causes ra_serf to use local file contents where it could avoid hitting the
> network will, I expect, cause still more instances of directories for which
> no immediate file children's contents will be fetched.
>
> Philip, Lieven, somebody -- care to double-check my thinking here?  I'm only
> 80% confident in my findings at the moment.
>

I haven't checked the code for all details yet, but I can confirm from
looking at finish_report that directories are only closed when at
least all files in it have been processed, and empty directories
aren't even considered in the outer close_dir loop.
Your analysis confirms what I see with additional logging.

Do you plan on patching this issue? If yes I can continue looking at
Philip's original issue in this thread - assuming it is not caused by
the memory leak.

Lieven

Re: Random serf checkout failures

Posted by "C. Michael Pilato" <cm...@collab.net>.

On 11/05/2012 01:33 PM, C. Michael Pilato wrote:
> On 11/05/2012 11:49 AM, Lieven Govaerts wrote:
>> My current test with svn trunk & apache+mod_dav_svn 1.6.12 does not
>> reproduce this issue.
>>
>> What I am seeing is that svn is using a lot of memory, up to 1GB.
>> Currently looking at the root cause there.
>>
>> First clue: many directories are not closed until the call to
>> close_all_dirs in libsvn_ra_serf/update.c:finish_report. This means
>> that for the most part of the checkout, the associated info object is
>> kept in memory.
> 
> My debugging indicates that when close_all_dirs() is called, there are a
> slew of unclosed directories which each have a non-zero ref_count.  Save for
> the root directory of the edit (/tags), fetch_props == TRUE,
> propfind_handler->done == TRUE, and tag_closed == TRUE for all open child
> directories.  So I'm looking around at how file references are managed to
> see if anything has leaked.  I might have inadvertently botched something
> when adding the pull-contents-from-the-wc-cache logic.

So ... if I'm reading the ra_serf code correctly (and let me tell you,
there's a good chance I'm not), the only time we opportunistically close
directory batons is when we're processing the current list of completed file
fetches.  See the code which follows this comment:

          /* If we have a valid directory and
           * we have no open items in this dir and
           * we've closed the directory tag (no more children can be added)
           * and either:
           *   we know we won't be fetching props or
           *   we've already completed the propfind
           * then, we know it's time for us to close this directory.
           */

That code is all encompassed within an outer loop that looks like so:

      /* prune our fetches list if they are done. */
      done_list = report->done_fetches;
      while (done_list)
        {

And all that is nested within yet another loop that runs until the REPORT
response is all parse and all auxiliary GETs and PROPFINDs are finished.

So, back to the opportunistic directory closure:  *if* we've finished any
GETs since the last time we checked, and if ${STUFF_IN_THAT_BIG_COMMENT},
then we try to close the parent directory of the file we fetched, plus any
other close-able parent directories thereof.

But what about when there are no file content fetches to manage?

Using a trunk client and 1.7.x server, I just tried to export a tree with
15k directories (10 wide at the top, 15 or so below that, and 100 or so
below those) and no files.  153 Mb of memory peak usage, and not a single
directory closure until the whole tree had been processed/PROPFINDed/etc.

Unless I'm mistaken, we will delay until the very end the closure of every
directory (and by refcount-management-extension, every parent thereof) for
which we don't execute at least one GET for an immediately child thereof.  I
expect this situation is not new to trunk.  (In fact, repeating the above
export with a 1.7 client, I see a peak memory using of 458 Mb!)  Memory
usage improvements in trunk help matters.  But then, the feature which
causes ra_serf to use local file contents where it could avoid hitting the
network will, I expect, cause still more instances of directories for which
no immediate file children's contents will be fetched.

Philip, Lieven, somebody -- care to double-check my thinking here?  I'm only
80% confident in my findings at the moment.

-- 
C. Michael Pilato <cm...@collab.net>
CollabNet   <>   www.collab.net   <>   Enterprise Cloud Development

Re: Random serf checkout failures

Posted by Justin Erenkrantz <ju...@erenkrantz.com>.

On Tue, Nov 6, 2012 at 3:22 PM, C. Michael Pilato <cm...@collab.net> wrote:
> On 11/06/2012 10:15 AM, C. Michael Pilato wrote:
>> On 11/06/2012 08:29 AM, C. Michael Pilato wrote:
>>> I recall this patch of yours -- I even asked you about it post-Berlin.  I'll
>>> take a look at it now and see if it can help us out.
>>
>> Not seeing the active_dir_propfinds list corruption you mentioned, Justin.
>
> Bummer.  Now I am.

*grin*

Yah, the pool lifetimes are all...overly complicated.  The 'right' way
to solve it, IMO, is to redo the parsing logic akin to how Greg was
redoing the other ra_serf files.  -- justin

Re: Random serf checkout failures

Posted by "C. Michael Pilato" <cm...@collab.net>.

On 11/06/2012 10:15 AM, C. Michael Pilato wrote:
> On 11/06/2012 08:29 AM, C. Michael Pilato wrote:
>> I recall this patch of yours -- I even asked you about it post-Berlin.  I'll
>> take a look at it now and see if it can help us out.
> 
> Not seeing the active_dir_propfinds list corruption you mentioned, Justin.

Bummer.  Now I am.

-- 
C. Michael Pilato <cm...@collab.net>
CollabNet   <>   www.collab.net   <>   Enterprise Cloud Development

Re: Random serf checkout failures

Posted by "C. Michael Pilato" <cm...@collab.net>.

On 11/06/2012 08:29 AM, C. Michael Pilato wrote:
> On 11/05/2012 08:30 PM, Justin Erenkrantz wrote:
>> On Mon, Nov 5, 2012 at 1:33 PM, C. Michael Pilato <cm...@collab.net> wrote:
>>> My debugging indicates that when close_all_dirs() is called, there are a
>>> slew of unclosed directories which each have a non-zero ref_count.  Save for
>>> the root directory of the edit (/tags), fetch_props == TRUE,
>>> propfind_handler->done == TRUE, and tag_closed == TRUE for all open child
>>> directories.  So I'm looking around at how file references are managed to
>>> see if anything has leaked.  I might have inadvertently botched something
>>> when adding the pull-contents-from-the-wc-cache logic.
>>
>> Yup - your analysis is pretty accurate IIRC.  I posted about this when
>> I was in Berlin and even had a few patches to start to clean this up.
>> However, Greg was planning on redoing how we parse the XML report;
>> when that happens, some of this management should be far simpler (as
>> we know when the tags are closed) - so the thinking was to wait until
>> that was done and then assess whether that was fixed or not.  --
>> justin
>>
> 
> I recall this patch of yours -- I even asked you about it post-Berlin.  I'll
> take a look at it now and see if it can help us out.

Not seeing the active_dir_propfinds list corruption you mentioned, Justin.
I removed all the allocator stuff, and simplified the patch to mimic what is
done for the file propfind situation.  One changed I did need to make was to
called open_dir() (which should be named ensure_open_dir() or somesuch)
before then trying to close directories because it seems like we were trying
to call set_dir_props() before the directory had ever been opened.

With your base patch and my changes, I was able to get an 'svn export' which
used to top out at 210Mb to peak at 27Mb instead!

I need to run the test suite and also verify that we are in fact driving the
update editor in, well, as legal a fashion as ra_serf can.  :-)

-- 
C. Michael Pilato <cm...@collab.net>
CollabNet   <>   www.collab.net   <>   Enterprise Cloud Development

Re: Random serf checkout failures

Posted by "C. Michael Pilato" <cm...@collab.net>.

On 11/05/2012 08:30 PM, Justin Erenkrantz wrote:
> On Mon, Nov 5, 2012 at 1:33 PM, C. Michael Pilato <cm...@collab.net> wrote:
>> My debugging indicates that when close_all_dirs() is called, there are a
>> slew of unclosed directories which each have a non-zero ref_count.  Save for
>> the root directory of the edit (/tags), fetch_props == TRUE,
>> propfind_handler->done == TRUE, and tag_closed == TRUE for all open child
>> directories.  So I'm looking around at how file references are managed to
>> see if anything has leaked.  I might have inadvertently botched something
>> when adding the pull-contents-from-the-wc-cache logic.
> 
> Yup - your analysis is pretty accurate IIRC.  I posted about this when
> I was in Berlin and even had a few patches to start to clean this up.
> However, Greg was planning on redoing how we parse the XML report;
> when that happens, some of this management should be far simpler (as
> we know when the tags are closed) - so the thinking was to wait until
> that was done and then assess whether that was fixed or not.  --
> justin
> 

I recall this patch of yours -- I even asked you about it post-Berlin.  I'll
take a look at it now and see if it can help us out.

-- 
C. Michael Pilato <cm...@collab.net>
CollabNet   <>   www.collab.net   <>   Enterprise Cloud Development

Re: Random serf checkout failures

Posted by Justin Erenkrantz <ju...@erenkrantz.com>.

On Mon, Nov 5, 2012 at 1:33 PM, C. Michael Pilato <cm...@collab.net> wrote:
> My debugging indicates that when close_all_dirs() is called, there are a
> slew of unclosed directories which each have a non-zero ref_count.  Save for
> the root directory of the edit (/tags), fetch_props == TRUE,
> propfind_handler->done == TRUE, and tag_closed == TRUE for all open child
> directories.  So I'm looking around at how file references are managed to
> see if anything has leaked.  I might have inadvertently botched something
> when adding the pull-contents-from-the-wc-cache logic.

Yup - your analysis is pretty accurate IIRC.  I posted about this when
I was in Berlin and even had a few patches to start to clean this up.
However, Greg was planning on redoing how we parse the XML report;
when that happens, some of this management should be far simpler (as
we know when the tags are closed) - so the thinking was to wait until
that was done and then assess whether that was fixed or not.  --
justin

Re: Random serf checkout failures

Posted by "C. Michael Pilato" <cm...@collab.net>.

On 11/05/2012 11:49 AM, Lieven Govaerts wrote:
> On Mon, Nov 5, 2012 at 12:41 PM, Lieven Govaerts <lg...@mobsol.be> wrote:
>> On Mon, Nov 5, 2012 at 11:33 AM, Philip Martin
>> <ph...@wandisco.com> wrote:
>>> Philip Martin <ph...@codematters.co.uk> writes:
>>>
>>>> I've been experimenting checking out subversion/tags from my local
>>>> mirror of the Subversion repository using a client and mod_dav_svn from
>>>> trunk.  I'm getting several different errors seemingly at random, I
>>>> suspect they may be related to some common underlying problem.
>>>
>>> I can reproduce this using another dataset that should easier for others
>>> to reproduce: import trunk 150 times making 150 random file changes each
>>> time.
>>>
>>> svn export http://svn.apache.org/repos/asf/subversion/trunk trunk
>>> svnadmin create repo
>>> svn mkdir -mm file://`pwd`/repo/tags
>>> for t in `seq 1 150` ; do
>>>   for f in `find trunk -type f | sort -R | head -150` ; do
>>>     echo xx$t >> $f
>>>   done
>>>   svn import -mm trunk file://`pwd`/repo/tags/$t
>>> done
>>>
>>
>> Alright thanks, I'll try this at the hackathon this afternoon.
>>
> 
> My current test with svn trunk & apache+mod_dav_svn 1.6.12 does not
> reproduce this issue.
> 
> What I am seeing is that svn is using a lot of memory, up to 1GB.
> Currently looking at the root cause there.
> 
> First clue: many directories are not closed until the call to
> close_all_dirs in libsvn_ra_serf/update.c:finish_report. This means
> that for the most part of the checkout, the associated info object is
> kept in memory.

My debugging indicates that when close_all_dirs() is called, there are a
slew of unclosed directories which each have a non-zero ref_count.  Save for
the root directory of the edit (/tags), fetch_props == TRUE,
propfind_handler->done == TRUE, and tag_closed == TRUE for all open child
directories.  So I'm looking around at how file references are managed to
see if anything has leaked.  I might have inadvertently botched something
when adding the pull-contents-from-the-wc-cache logic.

-- 
C. Michael Pilato <cm...@collab.net>
CollabNet   <>   www.collab.net   <>   Enterprise Cloud Development

Re: Random serf checkout failures

Posted by Lieven Govaerts <lg...@mobsol.be>.

On Mon, Nov 5, 2012 at 12:41 PM, Lieven Govaerts <lg...@mobsol.be> wrote:
> On Mon, Nov 5, 2012 at 11:33 AM, Philip Martin
> <ph...@wandisco.com> wrote:
>> Philip Martin <ph...@codematters.co.uk> writes:
>>
>>> I've been experimenting checking out subversion/tags from my local
>>> mirror of the Subversion repository using a client and mod_dav_svn from
>>> trunk.  I'm getting several different errors seemingly at random, I
>>> suspect they may be related to some common underlying problem.
>>
>> I can reproduce this using another dataset that should easier for others
>> to reproduce: import trunk 150 times making 150 random file changes each
>> time.
>>
>> svn export http://svn.apache.org/repos/asf/subversion/trunk trunk
>> svnadmin create repo
>> svn mkdir -mm file://`pwd`/repo/tags
>> for t in `seq 1 150` ; do
>>   for f in `find trunk -type f | sort -R | head -150` ; do
>>     echo xx$t >> $f
>>   done
>>   svn import -mm trunk file://`pwd`/repo/tags/$t
>> done
>>
>
> Alright thanks, I'll try this at the hackathon this afternoon.
>

My current test with svn trunk & apache+mod_dav_svn 1.6.12 does not
reproduce this issue.

What I am seeing is that svn is using a lot of memory, up to 1GB.
Currently looking at the root cause there.

First clue: many directories are not closed until the call to
close_all_dirs in libsvn_ra_serf/update.c:finish_report. This means
that for the most part of the checkout, the associated info object is
kept in memory.

>> Another checkout error to add to list of ways it can fail:
>>
>> A    wc/86/subversion/tests/cmdline/import_tests_data/import_tree/DIR1.noo
>> A    wc/86/subversion/tests/cmdline/import_tests_data/import_tree/DIR6/DIR7/DIR8.noo
>> A    wc/86/subversion/bindings/swig/proxy
>> ../src/subversion/svn/checkout-cmd.c:168: (apr_err=175009)
>> ../src/subversion/libsvn_client/checkout.c:179: (apr_err=175009)
>> ../src/subversion/libsvn_client/update.c:579: (apr_err=175009)
>> ../src/subversion/libsvn_client/update.c:440: (apr_err=175009)
>> ../src/subversion/libsvn_wc/adm_crawler.c:858: (apr_err=175009)
>> ../src/subversion/libsvn_ra_serf/update.c:2676: (apr_err=175009)
>> ../src/subversion/libsvn_ra_serf/update.c:2672: (apr_err=175009)
>> svn: E175009: Missing update-report close tag
>>
>
> From your previous report I noticed the 406 response that serf isn't
> handling, at least serf doesn't support a response coming in while the
> matching request wasn't completely sent yet. I guess that 406 response
> was for a REPORT request not flushed fast enough.
>
Lieven

Re: Random serf checkout failures

Posted by Philip Martin <ph...@wandisco.com>.

Ivan Zhakov <iv...@visualsvn.com> writes:

> Could please test attached serf patch. This patch should fix problem
> with reading one connection for long time without reading from other
> connections.

I still see assertions with your patch and Lieven's patch together.

-- 
Certified & Supported Apache Subversion Downloads:
http://www.wandisco.com/subversion/download

Re: Random serf checkout failures

Posted by Ivan Zhakov <iv...@visualsvn.com>.

On Wed, Nov 7, 2012 at 1:41 AM, Ivan Zhakov <iv...@visualsvn.com> wrote:
> On Tue, Nov 6, 2012 at 10:24 PM, Ivan Zhakov <iv...@visualsvn.com> wrote:
>> On Tue, Nov 6, 2012 at 9:13 PM, Lieven Govaerts <lg...@mobsol.be> wrote:
>>> Hi,
>>>
>>> On Tue, Nov 6, 2012 at 4:50 PM, Lieven Govaerts <lg...@mobsol.be> wrote:
>>>> Ben,
>>>>
>>>> On Tue, Nov 6, 2012 at 4:09 PM, Ben Reser <be...@reser.org> wrote:
>>>>> I worked with Philip today and was able to reproduce the exact problem
>>>>> he's been seeing.  I ended up having to get his full httpd.conf to
>>>>> figure it out..
>>>>>
>>>>> Ultimately the problem proved to be that he had this directive:
>>>>> Timeout 3
>>>>>
>>>>> Which would mean if we don't tend a connection for 3 seconds Apache
>>>>> will close it.  Serf should be able to deal with the connection being
>>>>> closed.
>>>>>
>>>
>>> okay, so with the Timeout directive added I can reproduce this issue.
>>>
>>> What I see is that the server closes the connection in the middle of
>>> sending a response to the client. It doesn't even finalize the
>>> response first.
>>> So ra_serf is reading the data from the response bucket, but gets an
>>> APR_EOF when it needs more data than specified in the Content-Length
>>> header of the response.
>>>
>>> What is the expected behavior here, let serf close down the connection
>>> and try the request again on a new connection?
>>>
>> I think no. Timeout 3 directive means "abort the connection if client
>> didn't read data within 3 seconds"
>>
>> So most likely reason the client is busy with doing something for a
>> long time without reading data from this network connection.
>>
>> Probably it's related to the issue I found today:
>> 1. We reading REPORT faster than complete PROPFINDs/GETs
>> 2. Data from REPORT response going to spillbuf: 1mb going to memory,
>> others stored to disk.
>> 3. After completing some PROPFINDs/GETs serf update editor resumes
>> parsing data from spillbuf: 1mb in one chunk
>>
>> Two problems comes here:
>> 1. When parsing such big block we do not read data from network,
>> leading timeouts in some cases
>>
>> 2. All requests created while parsing this chunk is created in one
>> connection, that slows ra_serf
>>
> Another problem is how serf reads data from network in case of
> multiple connections: it reads data from one connection until EAGAIN.
> But if data comes from network really fast (from local server for
> example) it continue reading data from this connection, without
> reading data from other connection! Which leads time out them.
>
> See outgoing.c:read_from_connection()
>
Hi Philip,

Could please test attached serf patch. This patch should fix problem
with reading one connection for long time without reading from other
connections.


-- 
Ivan Zhakov

Re: Random serf checkout failures

Posted by Justin Erenkrantz <ju...@erenkrantz.com>.

On Tue, Nov 6, 2012 at 4:41 PM, Ivan Zhakov <iv...@visualsvn.com> wrote:
> Another problem is how serf reads data from network in case of
> multiple connections: it reads data from one connection until EAGAIN.
> But if data comes from network really fast (from local server for
> example) it continue reading data from this connection, without
> reading data from other connection! Which leads time out them.
>
> See outgoing.c:read_from_connection()

Hmm.  I'm not sure if there's an easy answer for this while keeping
ra_serf single-threaded.  Perhaps keep a track of how much we've read
in a single connection and give up reading that connection for a bit
if we've read 100KB or some such to see if there are other connections
with responses ready to read.

But, then again, here in 2012, I'd probably fight a lot harder to make
the checkout logic run across multiple network threads as that would
spread the SSL or gzip loads across multiple cores.  -- justin

Re: Random serf checkout failures

Posted by Ivan Zhakov <iv...@visualsvn.com>.

On Tue, Nov 6, 2012 at 10:24 PM, Ivan Zhakov <iv...@visualsvn.com> wrote:
> On Tue, Nov 6, 2012 at 9:13 PM, Lieven Govaerts <lg...@mobsol.be> wrote:
>> Hi,
>>
>> On Tue, Nov 6, 2012 at 4:50 PM, Lieven Govaerts <lg...@mobsol.be> wrote:
>>> Ben,
>>>
>>> On Tue, Nov 6, 2012 at 4:09 PM, Ben Reser <be...@reser.org> wrote:
>>>> I worked with Philip today and was able to reproduce the exact problem
>>>> he's been seeing.  I ended up having to get his full httpd.conf to
>>>> figure it out..
>>>>
>>>> Ultimately the problem proved to be that he had this directive:
>>>> Timeout 3
>>>>
>>>> Which would mean if we don't tend a connection for 3 seconds Apache
>>>> will close it.  Serf should be able to deal with the connection being
>>>> closed.
>>>>
>>
>> okay, so with the Timeout directive added I can reproduce this issue.
>>
>> What I see is that the server closes the connection in the middle of
>> sending a response to the client. It doesn't even finalize the
>> response first.
>> So ra_serf is reading the data from the response bucket, but gets an
>> APR_EOF when it needs more data than specified in the Content-Length
>> header of the response.
>>
>> What is the expected behavior here, let serf close down the connection
>> and try the request again on a new connection?
>>
> I think no. Timeout 3 directive means "abort the connection if client
> didn't read data within 3 seconds"
>
> So most likely reason the client is busy with doing something for a
> long time without reading data from this network connection.
>
> Probably it's related to the issue I found today:
> 1. We reading REPORT faster than complete PROPFINDs/GETs
> 2. Data from REPORT response going to spillbuf: 1mb going to memory,
> others stored to disk.
> 3. After completing some PROPFINDs/GETs serf update editor resumes
> parsing data from spillbuf: 1mb in one chunk
>
> Two problems comes here:
> 1. When parsing such big block we do not read data from network,
> leading timeouts in some cases
>
> 2. All requests created while parsing this chunk is created in one
> connection, that slows ra_serf
>
Another problem is how serf reads data from network in case of
multiple connections: it reads data from one connection until EAGAIN.
But if data comes from network really fast (from local server for
example) it continue reading data from this connection, without
reading data from other connection! Which leads time out them.

See outgoing.c:read_from_connection()


-- 
Ivan Zhakov

Re: Random serf checkout failures

Posted by Ivan Zhakov <iv...@visualsvn.com>.

On Tue, Nov 6, 2012 at 10:24 PM, Ivan Zhakov <iv...@visualsvn.com> wrote:
> On Tue, Nov 6, 2012 at 9:13 PM, Lieven Govaerts <lg...@mobsol.be> wrote:
>> Hi,
>>
>> On Tue, Nov 6, 2012 at 4:50 PM, Lieven Govaerts <lg...@mobsol.be> wrote:
>>> Ben,
>>>
>>> On Tue, Nov 6, 2012 at 4:09 PM, Ben Reser <be...@reser.org> wrote:
>>>> I worked with Philip today and was able to reproduce the exact problem
>>>> he's been seeing.  I ended up having to get his full httpd.conf to
>>>> figure it out..
>>>>
>>>> Ultimately the problem proved to be that he had this directive:
>>>> Timeout 3
>>>>
>>>> Which would mean if we don't tend a connection for 3 seconds Apache
>>>> will close it.  Serf should be able to deal with the connection being
>>>> closed.
>>>>
>>
>> okay, so with the Timeout directive added I can reproduce this issue.
>>
>> What I see is that the server closes the connection in the middle of
>> sending a response to the client. It doesn't even finalize the
>> response first.
>> So ra_serf is reading the data from the response bucket, but gets an
>> APR_EOF when it needs more data than specified in the Content-Length
>> header of the response.
>>
>> What is the expected behavior here, let serf close down the connection
>> and try the request again on a new connection?
>>
> I think no. Timeout 3 directive means "abort the connection if client
> didn't read data within 3 seconds"
>
> So most likely reason the client is busy with doing something for a
> long time without reading data from this network connection.
>
> Probably it's related to the issue I found today:
> 1. We reading REPORT faster than complete PROPFINDs/GETs
> 2. Data from REPORT response going to spillbuf: 1mb going to memory,
> others stored to disk.
> 3. After completing some PROPFINDs/GETs serf update editor resumes
> parsing data from spillbuf: 1mb in one chunk
>
> Two problems comes here:
> 1. When parsing such big block we do not read data from network,
> leading timeouts in some cases
>
> 2. All requests created while parsing this chunk is created in one
> connection, that slows ra_serf
>
There is comment in
subversion\libsvn_ra_serf\util.c:svn_ra_serf__process_pending()
[[
  /* ### it is possible that the XML parsing of the pending content is
     ### so slow, and that we don't return to reading the connection
     ### fast enough... that the server will disconnect us. right now,
     ### that is highly improbable, but is noted for future's sake.
     ### should that ever happen, the loops in this function can simply
     ### terminate after N seconds.  */
]]]


-- 
Ivan Zhakov

Re: Random serf checkout failures

Posted by Ivan Zhakov <iv...@visualsvn.com>.

On Tue, Nov 6, 2012 at 9:13 PM, Lieven Govaerts <lg...@mobsol.be> wrote:
> Hi,
>
> On Tue, Nov 6, 2012 at 4:50 PM, Lieven Govaerts <lg...@mobsol.be> wrote:
>> Ben,
>>
>> On Tue, Nov 6, 2012 at 4:09 PM, Ben Reser <be...@reser.org> wrote:
>>> I worked with Philip today and was able to reproduce the exact problem
>>> he's been seeing.  I ended up having to get his full httpd.conf to
>>> figure it out..
>>>
>>> Ultimately the problem proved to be that he had this directive:
>>> Timeout 3
>>>
>>> Which would mean if we don't tend a connection for 3 seconds Apache
>>> will close it.  Serf should be able to deal with the connection being
>>> closed.
>>>
>
> okay, so with the Timeout directive added I can reproduce this issue.
>
> What I see is that the server closes the connection in the middle of
> sending a response to the client. It doesn't even finalize the
> response first.
> So ra_serf is reading the data from the response bucket, but gets an
> APR_EOF when it needs more data than specified in the Content-Length
> header of the response.
>
> What is the expected behavior here, let serf close down the connection
> and try the request again on a new connection?
>
I think no. Timeout 3 directive means "abort the connection if client
didn't read data within 3 seconds"

So most likely reason the client is busy with doing something for a
long time without reading data from this network connection.

Probably it's related to the issue I found today:
1. We reading REPORT faster than complete PROPFINDs/GETs
2. Data from REPORT response going to spillbuf: 1mb going to memory,
others stored to disk.
3. After completing some PROPFINDs/GETs serf update editor resumes
parsing data from spillbuf: 1mb in one chunk

Two problems comes here:
1. When parsing such big block we do not read data from network,
leading timeouts in some cases

2. All requests created while parsing this chunk is created in one
connection, that slows ra_serf

-- 
Ivan Zhakov

Re: Random serf checkout failures

Posted by Philip Martin <ph...@wandisco.com>.

Philip Martin <ph...@wandisco.com> writes:

> but I still see this as well:
>
> A    wc/1.3.0-rc1/subversion/bindings/java/javahl/src/org/tigris/subversion/javahl/SVNAdmin.java
> ../src/subversion/libsvn_ra_serf/update.c:680: (apr_err=235000)
> svn: E235000: In file '../src/subversion/libsvn_ra_serf/update.c' line 680: assertion failed (! dir->ref_count)
>
> Program received signal SIGABRT, Aborted.

and this assert as well:

../src/subversion/libsvn_wc/wc_db.c:1658: (apr_err=235000)
svn: E235000: In file '../src/subversion/libsvn_wc/wc_db.c' line 1658: assertion failed (SVN_IS_VALID_REVNUM(changed_rev))

Program received signal SIGABRT, Aborted.
1658	  SVN_ERR_ASSERT(SVN_IS_VALID_REVNUM(changed_rev));
(gdb) p changed_rev
$2 = -1
(gdb) bt
#3  0x00007ffff6feb64e in svn_error__malfunction (can_return=1, 
    file=0x7ffff79393f8 "../src/subversion/libsvn_wc/wc_db.c", line=1658, 
    expr=0x7ffff79398c8 "SVN_IS_VALID_REVNUM(changed_rev)")
    at ../src/subversion/libsvn_subr/error.c:703
#4  0x00007ffff78fe498 in svn_wc__db_base_add_directory (db=0x7ffff32fd6d8, 
    local_abspath=0x7ffff1ab0198 "/home/pm/sw/subversion/obj/wc/1.3.0-rc1/subversion/tests/clients", 
    wri_abspath=0x7ffff7dff218 "/home/pm/sw/subversion/obj/wc", 
    repos_relpath=0x7ffff1ab01e0 "subversion/tags/1.3.0-rc1/subversion/tests/clients", repos_root_url=0x7ffff7e44f88 "http://peri.local:8888/repos/asf", 
    repos_uuid=0x7ffff7e44fb0 "13f79535-47bb-0310-9956-ffa450edef68", 
    revision=1406090, props=0x7ffff1ab4560, changed_rev=-1, changed_date=0, 
    changed_author=0x0, children=0x0, depth=svn_depth_infinity, dav_cache=0x0, 
    conflict=0x0, update_actual_props=0, new_actual_props=0x0, new_iprops=0x0, 
    work_items=0x0, scratch_pool=0x7ffff1ab0028)
    at ../src/subversion/libsvn_wc/wc_db.c:1658
#5  0x00007ffff78ec66e in close_directory (dir_baton=0x7ffff1ab00a0, 
    pool=0x7ffff1ab4028) at ../src/subversion/libsvn_wc/update_editor.c:2725
#6  0x00007ffff745eb6e in close_directory (dir_baton=0x7ffff1ab40a0, 
    pool=0x7ffff1ab4028) at ../src/subversion/libsvn_delta/cancel.c:261
#7  0x00007ffff5b1ec0f in close_dir (dir=0x7ffff1aae0c8)
    at ../src/subversion/libsvn_ra_serf/update.c:639
#8  0x00007ffff5b1ee22 in close_all_dirs (dir=0x7ffff1aae0c8)
    at ../src/subversion/libsvn_ra_serf/update.c:684
#9  0x00007ffff5b1ed4a in close_all_dirs (dir=0x7ffff1af40c8)
    at ../src/subversion/libsvn_ra_serf/update.c:676
#10 0x00007ffff5b1ed4a in close_all_dirs (dir=0x7ffff271a0c8)
    at ../src/subversion/libsvn_ra_serf/update.c:676
#11 0x00007ffff5b1ed4a in close_all_dirs (dir=0x7ffff2b290c8)
    at ../src/subversion/libsvn_ra_serf/update.c:676
#12 0x00007ffff5b1ed4a in close_all_dirs (dir=0x7ffff32a10c8)
    at ../src/subversion/libsvn_ra_serf/update.c:676
#13 0x00007ffff5b2420f in finish_report (report_baton=0x7ffff7e45090, 
    pool=0x7ffff7e42028) at ../src/subversion/libsvn_ra_serf/update.c:2746
#14 0x00007ffff789bc6d in svn_wc_crawl_revisions5 (wc_ctx=0x7ffff32fd6c0, 
    local_abspath=0x7ffff7e42168 "/home/pm/sw/subversion/obj/wc", 
    reporter=0x7ffff5d394e0, report_baton=0x7ffff7e45090, restore_files=1, 
    depth=svn_depth_unknown, honor_depth_exclude=1, 
    depth_compatibility_trick=0, use_commit_times=0, 
    cancel_func=0x416b43 <svn_cl__check_cancel>, cancel_baton=0x0, 
    notify_func=0x41c467 <notify>, notify_baton=0x7ffff32fd9b0, 
    scratch_pool=0x7ffff7e42028)
    at ../src/subversion/libsvn_wc/adm_crawler.c:858
#15 0x00007ffff7bc7f6d in update_internal (result_rev=0x0, 
    local_abspath=0x7ffff7e42168 "/home/pm/sw/subversion/obj/wc", 
    anchor_abspath=0x7ffff7e438a0 "/home/pm/sw/subversion/obj/wc", 
    revision=0x7fffffffe050, depth=svn_depth_unknown, depth_is_sticky=0, 
    ignore_externals=0, allow_unver_obstructions=0, adds_as_modification=1, 
    timestamp_sleep=0x7fffffffe15c, notify_summary=1, ctx=0x7ffff32fd608, 
    pool=0x7ffff7e42028) at ../src/subversion/libsvn_client/update.c:427


-- 
Certified & Supported Apache Subversion Downloads:
http://www.wandisco.com/subversion/download

Re: Random serf checkout failures

Posted by Philip Martin <ph...@wandisco.com>.

Lieven Govaerts <lg...@mobsol.be> writes:

>> Philip or Ben, can you test that with this patch svn will always stop
>> with a communication error?

Using serf 1.1.x patched and Subversion 1406366 I get this new error:

A    wc/1.3.0-rc1/subversion/libsvn_subr/config.c
../src/subversion/svn/checkout-cmd.c:168: (apr_err=120105)
../src/subversion/libsvn_client/checkout.c:179: (apr_err=120105)
../src/subversion/libsvn_client/update.c:579: (apr_err=120105)
../src/subversion/libsvn_client/update.c:440: (apr_err=120105)
../src/subversion/libsvn_wc/adm_crawler.c:858: (apr_err=120105)
../src/subversion/libsvn_ra_serf/update.c:2564: (apr_err=120105)
../src/subversion/libsvn_ra_serf/util.c:2028: (apr_err=120105)
../src/subversion/libsvn_ra_serf/util.c:2009: (apr_err=120105)
../src/subversion/libsvn_ra_serf/update.c:989: (apr_err=120105)
svn: E120105: ra_serf: The server sent an improper HTTP response

but I still see this as well:

A    wc/1.3.0-rc1/subversion/bindings/java/javahl/src/org/tigris/subversion/javahl/SVNAdmin.java
../src/subversion/libsvn_ra_serf/update.c:680: (apr_err=235000)
svn: E235000: In file '../src/subversion/libsvn_ra_serf/update.c' line 680: assertion failed (! dir->ref_count)

Program received signal SIGABRT, Aborted.
0x00007ffff65ea475 in *__GI_raise (sig=<optimized out>)
    at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
64	../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) p dir->ref_count
$1 = 1
(gdb) bt
#0  0x00007ffff65ea475 in *__GI_raise (sig=<optimized out>)
    at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00007ffff65ed6f0 in *__GI_abort () at abort.c:92
#2  0x00007ffff6feb5fd in svn_error_abort_on_malfunction (can_return=1, 
    file=0x7ffff5b2ff78 "../src/subversion/libsvn_ra_serf/update.c", line=680, 
    expr=0x7ffff5b3001e "! dir->ref_count")
    at ../src/subversion/libsvn_subr/error.c:678
#3  0x00007ffff6feb64e in svn_error__malfunction (can_return=1, 
    file=0x7ffff5b2ff78 "../src/subversion/libsvn_ra_serf/update.c", line=680, 
    expr=0x7ffff5b3001e "! dir->ref_count")
    at ../src/subversion/libsvn_subr/error.c:703
#4  0x00007ffff5b1edc0 in close_all_dirs (dir=0x7ffff1e810c8)
    at ../src/subversion/libsvn_ra_serf/update.c:680
#5  0x00007ffff5b1ed4a in close_all_dirs (dir=0x7ffff1e830c8)
    at ../src/subversion/libsvn_ra_serf/update.c:676
#6  0x00007ffff5b1ed4a in close_all_dirs (dir=0x7ffff1e850c8)
    at ../src/subversion/libsvn_ra_serf/update.c:676
#7  0x00007ffff5b1ed4a in close_all_dirs (dir=0x7ffff1e870c8)
    at ../src/subversion/libsvn_ra_serf/update.c:676
#8  0x00007ffff5b1ed4a in close_all_dirs (dir=0x7ffff1e8d0c8)
    at ../src/subversion/libsvn_ra_serf/update.c:676
#9  0x00007ffff5b1ed4a in close_all_dirs (dir=0x7ffff20090c8)
    at ../src/subversion/libsvn_ra_serf/update.c:676
#10 0x00007ffff5b1ed4a in close_all_dirs (dir=0x7ffff200b0c8)
    at ../src/subversion/libsvn_ra_serf/update.c:676
#11 0x00007ffff5b1ed4a in close_all_dirs (dir=0x7ffff20130c8)
    at ../src/subversion/libsvn_ra_serf/update.c:676
#12 0x00007ffff5b1ed4a in close_all_dirs (dir=0x7ffff24b90c8)
    at ../src/subversion/libsvn_ra_serf/update.c:676
#13 0x00007ffff5b1ed4a in close_all_dirs (dir=0x7ffff28f60c8)
    at ../src/subversion/libsvn_ra_serf/update.c:676
#14 0x00007ffff5b1ed4a in close_all_dirs (dir=0x7ffff32a10c8)
    at ../src/subversion/libsvn_ra_serf/update.c:676
#15 0x00007ffff5b2420f in finish_report (report_baton=0x7ffff7e45090, 
    pool=0x7ffff7e42028) at ../src/subversion/libsvn_ra_serf/update.c:2746
#16 0x00007ffff789bc6d in svn_wc_crawl_revisions5 (wc_ctx=0x7ffff32fd6b0, 
    local_abspath=0x7ffff7e42168 "/home/pm/sw/subversion/obj/wc", 
    reporter=0x7ffff5d394e0, report_baton=0x7ffff7e45090, restore_files=1, 
    depth=svn_depth_unknown, honor_depth_exclude=1, 
    depth_compatibility_trick=0, use_commit_times=0, 
    cancel_func=0x416b43 <svn_cl__check_cancel>, cancel_baton=0x0, 
    notify_func=0x41c467 <notify>, notify_baton=0x7ffff32fd9a0, 
    scratch_pool=0x7ffff7e42028)
    at ../src/subversion/libsvn_wc/adm_crawler.c:858
#17 0x00007ffff7bc7f6d in update_internal (result_rev=0x0, 
    local_abspath=0x7ffff7e42168 "/home/pm/sw/subversion/obj/wc", 
    anchor_abspath=0x7ffff7e438a0 "/home/pm/sw/subversion/obj/wc", 
    revision=0x7fffffffe050, depth=svn_depth_unknown, depth_is_sticky=0, 
    ignore_externals=0, allow_unver_obstructions=0, adds_as_modification=1, 
    timestamp_sleep=0x7fffffffe15c, notify_summary=1, ctx=0x7ffff32fd5f8, 
    pool=0x7ffff7e42028) at ../src/subversion/libsvn_client/update.c:427


-- 
Certified & Supported Apache Subversion Downloads:
http://www.wandisco.com/subversion/download

Re: Random serf checkout failures

Posted by Lieven Govaerts <lg...@mobsol.be>.

patch attached

On Tue, Nov 6, 2012 at 10:45 PM, Lieven Govaerts <lg...@mobsol.be> wrote:
> Hi,
>
> On Tue, Nov 6, 2012 at 6:13 PM, Lieven Govaerts <lg...@mobsol.be> wrote:
>> Hi,
>>
>> On Tue, Nov 6, 2012 at 4:50 PM, Lieven Govaerts <lg...@mobsol.be> wrote:
>>> Ben,
>>>
>>> On Tue, Nov 6, 2012 at 4:09 PM, Ben Reser <be...@reser.org> wrote:
>>>> I worked with Philip today and was able to reproduce the exact problem
>>>> he's been seeing.  I ended up having to get his full httpd.conf to
>>>> figure it out..
>>>>
>>>> Ultimately the problem proved to be that he had this directive:
>>>> Timeout 3
>>>>
>>>> Which would mean if we don't tend a connection for 3 seconds Apache
>>>> will close it.  Serf should be able to deal with the connection being
>>>> closed.
>>>>
>>
>> okay, so with the Timeout directive added I can reproduce this issue.
>>
>> What I see is that the server closes the connection in the middle of
>> sending a response to the client. It doesn't even finalize the
>> response first.
>> So ra_serf is reading the data from the response bucket, but gets an
>> APR_EOF when it needs more data than specified in the Content-Length
>> header of the response.
>>
>> What is the expected behavior here, let serf close down the connection
>> and try the request again on a new connection?
>
> IMHO, the reasonable thing to do here, is to report the error
> correctly to the user, so that we don't corrupt the working copy.
> This way the user can retry the transaction.
>
> For response bodies that are encoded (chunked or gzip/deflated), serf
> will already return an error status to svn/ra_serf. The one case where
> serf silently returns APR_EOF is when no encoding is used so the
> Content-Length header is set.
> Attached patch to serf trunk will calculate the actually read bytes
> from the response body, and ensures an error is returned to svn when
> this is less than what's expected. The patch is not final, I probably
> want to use another error code.
>
> Philip or Ben, can you test that with this patch svn will always stop
> with a communication error?
>
> Lieven

Re: Random serf checkout failures

Posted by Lieven Govaerts <lg...@mobsol.be>.

Hi,

On Tue, Nov 6, 2012 at 6:13 PM, Lieven Govaerts <lg...@mobsol.be> wrote:
> Hi,
>
> On Tue, Nov 6, 2012 at 4:50 PM, Lieven Govaerts <lg...@mobsol.be> wrote:
>> Ben,
>>
>> On Tue, Nov 6, 2012 at 4:09 PM, Ben Reser <be...@reser.org> wrote:
>>> I worked with Philip today and was able to reproduce the exact problem
>>> he's been seeing.  I ended up having to get his full httpd.conf to
>>> figure it out..
>>>
>>> Ultimately the problem proved to be that he had this directive:
>>> Timeout 3
>>>
>>> Which would mean if we don't tend a connection for 3 seconds Apache
>>> will close it.  Serf should be able to deal with the connection being
>>> closed.
>>>
>
> okay, so with the Timeout directive added I can reproduce this issue.
>
> What I see is that the server closes the connection in the middle of
> sending a response to the client. It doesn't even finalize the
> response first.
> So ra_serf is reading the data from the response bucket, but gets an
> APR_EOF when it needs more data than specified in the Content-Length
> header of the response.
>
> What is the expected behavior here, let serf close down the connection
> and try the request again on a new connection?

IMHO, the reasonable thing to do here, is to report the error
correctly to the user, so that we don't corrupt the working copy.
This way the user can retry the transaction.

For response bodies that are encoded (chunked or gzip/deflated), serf
will already return an error status to svn/ra_serf. The one case where
serf silently returns APR_EOF is when no encoding is used so the
Content-Length header is set.
Attached patch to serf trunk will calculate the actually read bytes
from the response body, and ensures an error is returned to svn when
this is less than what's expected. The patch is not final, I probably
want to use another error code.

Philip or Ben, can you test that with this patch svn will always stop
with a communication error?

Lieven

Re: Random serf checkout failures

Posted by Philip Martin <ph...@wandisco.com>.

Justin Erenkrantz <ju...@erenkrantz.com> writes:

> Sure sounds like an early TCP close causing a lost response.  httpd
> shouldn't be closing the TCP connection in the middle of the response
> as that's not when httpd would evaluate the Timeout directive...but,
> if you are going over a loopback connection, I think I've seen Windows
> lose bytes...but, I don't think Philip uses Windows...weird - I'd
> guess something else is going on.  -- justin

I'm using Linux and I see the problems over my LAN as well as localhost.

-- 
Certified & Supported Apache Subversion Downloads:
http://www.wandisco.com/subversion/download

Re: Random serf checkout failures

Posted by Justin Erenkrantz <ju...@erenkrantz.com>.

On Tue, Nov 6, 2012 at 12:13 PM, Lieven Govaerts <lg...@mobsol.be> wrote:
> okay, so with the Timeout directive added I can reproduce this issue.
>
> What I see is that the server closes the connection in the middle of
> sending a response to the client. It doesn't even finalize the
> response first.
> So ra_serf is reading the data from the response bucket, but gets an
> APR_EOF when it needs more data than specified in the Content-Length
> header of the response.
>
> What is the expected behavior here, let serf close down the connection
> and try the request again on a new connection?

Sure sounds like an early TCP close causing a lost response.  httpd
shouldn't be closing the TCP connection in the middle of the response
as that's not when httpd would evaluate the Timeout directive...but,
if you are going over a loopback connection, I think I've seen Windows
lose bytes...but, I don't think Philip uses Windows...weird - I'd
guess something else is going on.  -- justin

Re: Random serf checkout failures

Posted by Lieven Govaerts <lg...@mobsol.be>.

Hi,

On Tue, Nov 6, 2012 at 4:50 PM, Lieven Govaerts <lg...@mobsol.be> wrote:
> Ben,
>
> On Tue, Nov 6, 2012 at 4:09 PM, Ben Reser <be...@reser.org> wrote:
>> I worked with Philip today and was able to reproduce the exact problem
>> he's been seeing.  I ended up having to get his full httpd.conf to
>> figure it out..
>>
>> Ultimately the problem proved to be that he had this directive:
>> Timeout 3
>>
>> Which would mean if we don't tend a connection for 3 seconds Apache
>> will close it.  Serf should be able to deal with the connection being
>> closed.
>>

okay, so with the Timeout directive added I can reproduce this issue.

What I see is that the server closes the connection in the middle of
sending a response to the client. It doesn't even finalize the
response first.
So ra_serf is reading the data from the response bucket, but gets an
APR_EOF when it needs more data than specified in the Content-Length
header of the response.

What is the expected behavior here, let serf close down the connection
and try the request again on a new connection?

[..]

Lieven

Re: Random serf checkout failures

Posted by Lieven Govaerts <lg...@mobsol.be>.

Ben,

On Tue, Nov 6, 2012 at 4:09 PM, Ben Reser <be...@reser.org> wrote:
> I worked with Philip today and was able to reproduce the exact problem
> he's been seeing.  I ended up having to get his full httpd.conf to
> figure it out..
>
> Ultimately the problem proved to be that he had this directive:
> Timeout 3
>
> Which would mean if we don't tend a connection for 3 seconds Apache
> will close it.  Serf should be able to deal with the connection being
> closed.
>
> So we'll need to do some further investigation.  However, it should be
> possible to replicate this easier now by setting a lower Timeout.

Thanks for that extra info. I tried yesterday evening with both svn
trunk and mod_dav_svn trunk but for me everything worked ok.

I'm finalizing a Kerberos patch for serf now, as soon as that's done
I'll try the "Timeout 3" config change.

[..]

Lieven

Re: Random serf checkout failures

Posted by Ben Reser <be...@reser.org>.

I worked with Philip today and was able to reproduce the exact problem
he's been seeing.  I ended up having to get his full httpd.conf to
figure it out..

Ultimately the problem proved to be that he had this directive:
Timeout 3

Which would mean if we don't tend a connection for 3 seconds Apache
will close it.  Serf should be able to deal with the connection being
closed.

So we'll need to do some further investigation.  However, it should be
possible to replicate this easier now by setting a lower Timeout.

On Mon, Nov 5, 2012 at 5:17 PM, Philip Martin
<ph...@wandisco.com> wrote:
> Lieven Govaerts <lg...@mobsol.be> writes:
>
>> From your previous report I noticed the 406 response that serf isn't
>> handling, at least serf doesn't support a response coming in while the
>> matching request wasn't completely sent yet. I guess that 406 response
>> was for a REPORT request not flushed fast enough.
>
> In case it makes a difference I'm using the worker MPM configured:
>
>     KeepAlive On
>     StartServers          1
>     MaxClients            8
>     MinSpareThreads       2
>     MaxSpareThreads       8
>     ThreadsPerChild       8
>     MaxRequestsPerChild   0
>     MaxKeepAliveRequests 100
>
> so all requests are handled by threads in the same process.
>
> --
> Certified & Supported Apache Subversion Downloads:
> http://www.wandisco.com/subversion/download

Re: Random serf checkout failures

Posted by Philip Martin <ph...@wandisco.com>.

Lieven Govaerts <lg...@mobsol.be> writes:

> From your previous report I noticed the 406 response that serf isn't
> handling, at least serf doesn't support a response coming in while the
> matching request wasn't completely sent yet. I guess that 406 response
> was for a REPORT request not flushed fast enough.

In case it makes a difference I'm using the worker MPM configured:

    KeepAlive On
    StartServers          1
    MaxClients            8
    MinSpareThreads       2
    MaxSpareThreads       8
    ThreadsPerChild       8
    MaxRequestsPerChild   0
    MaxKeepAliveRequests 100

so all requests are handled by threads in the same process.

-- 
Certified & Supported Apache Subversion Downloads:
http://www.wandisco.com/subversion/download

Re: Random serf checkout failures

Posted by Lieven Govaerts <lg...@mobsol.be>.

On Mon, Nov 5, 2012 at 11:33 AM, Philip Martin
<ph...@wandisco.com> wrote:
> Philip Martin <ph...@codematters.co.uk> writes:
>
>> I've been experimenting checking out subversion/tags from my local
>> mirror of the Subversion repository using a client and mod_dav_svn from
>> trunk.  I'm getting several different errors seemingly at random, I
>> suspect they may be related to some common underlying problem.
>
> I can reproduce this using another dataset that should easier for others
> to reproduce: import trunk 150 times making 150 random file changes each
> time.
>
> svn export http://svn.apache.org/repos/asf/subversion/trunk trunk
> svnadmin create repo
> svn mkdir -mm file://`pwd`/repo/tags
> for t in `seq 1 150` ; do
>   for f in `find trunk -type f | sort -R | head -150` ; do
>     echo xx$t >> $f
>   done
>   svn import -mm trunk file://`pwd`/repo/tags/$t
> done
>

Alright thanks, I'll try this at the hackathon this afternoon.

> Another checkout error to add to list of ways it can fail:
>
> A    wc/86/subversion/tests/cmdline/import_tests_data/import_tree/DIR1.noo
> A    wc/86/subversion/tests/cmdline/import_tests_data/import_tree/DIR6/DIR7/DIR8.noo
> A    wc/86/subversion/bindings/swig/proxy
> ../src/subversion/svn/checkout-cmd.c:168: (apr_err=175009)
> ../src/subversion/libsvn_client/checkout.c:179: (apr_err=175009)
> ../src/subversion/libsvn_client/update.c:579: (apr_err=175009)
> ../src/subversion/libsvn_client/update.c:440: (apr_err=175009)
> ../src/subversion/libsvn_wc/adm_crawler.c:858: (apr_err=175009)
> ../src/subversion/libsvn_ra_serf/update.c:2676: (apr_err=175009)
> ../src/subversion/libsvn_ra_serf/update.c:2672: (apr_err=175009)
> svn: E175009: Missing update-report close tag
>

>From your previous report I noticed the 406 response that serf isn't
handling, at least serf doesn't support a response coming in while the
matching request wasn't completely sent yet. I guess that 406 response
was for a REPORT request not flushed fast enough.

Lieven

Re: Random serf checkout failures

Posted by Philip Martin <ph...@wandisco.com>.

Philip Martin <ph...@codematters.co.uk> writes:

> I've been experimenting checking out subversion/tags from my local
> mirror of the Subversion repository using a client and mod_dav_svn from
> trunk.  I'm getting several different errors seemingly at random, I
> suspect they may be related to some common underlying problem.

I can reproduce this using another dataset that should easier for others
to reproduce: import trunk 150 times making 150 random file changes each
time.

svn export http://svn.apache.org/repos/asf/subversion/trunk trunk
svnadmin create repo
svn mkdir -mm file://`pwd`/repo/tags
for t in `seq 1 150` ; do
  for f in `find trunk -type f | sort -R | head -150` ; do
    echo xx$t >> $f
  done
  svn import -mm trunk file://`pwd`/repo/tags/$t
done

Another checkout error to add to list of ways it can fail:

A    wc/86/subversion/tests/cmdline/import_tests_data/import_tree/DIR1.noo
A    wc/86/subversion/tests/cmdline/import_tests_data/import_tree/DIR6/DIR7/DIR8.noo
A    wc/86/subversion/bindings/swig/proxy
../src/subversion/svn/checkout-cmd.c:168: (apr_err=175009)
../src/subversion/libsvn_client/checkout.c:179: (apr_err=175009)
../src/subversion/libsvn_client/update.c:579: (apr_err=175009)
../src/subversion/libsvn_client/update.c:440: (apr_err=175009)
../src/subversion/libsvn_wc/adm_crawler.c:858: (apr_err=175009)
../src/subversion/libsvn_ra_serf/update.c:2676: (apr_err=175009)
../src/subversion/libsvn_ra_serf/update.c:2672: (apr_err=175009)
svn: E175009: Missing update-report close tag

-- 
Certified & Supported Apache Subversion Downloads:
http://www.wandisco.com/subversion/download