You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Erik Huelsmann <eh...@gmail.com> on 2009/05/27 09:38:48 UTC

Re: [Issue 3347] svn never times out when (public) IP changes

On Wed, May 27, 2009 at 1:28 AM, Stefan Sperling <st...@elego.de> wrote:
> http://subversion.tigris.org/issues/show_bug.cgi?id=3347
>
>
>
> User stsp changed the following:
>
>                What    |Old value                 |New value
> ================================================================================
>                  Status|NEW                       |RESOLVED
> --------------------------------------------------------------------------------
>              Resolution|                          |FIXED
> --------------------------------------------------------------------------------
>
>
>
>
> ------- Additional comments from stsp@tigris.org Tue May 26 16:27:59 -0700 2009 -------
> I've finally looked at this again, and there was indeed code which
> explicitly set the timeout for a socket to infinity when reading
> data from a socket used for the svn:// protocol. Very bad.

Actually, that was by design. Please ask Greg Hudson why, but I know
he's been referring to timeouts as used by HTTP being a real pain in
the butt for serious application development.

Bye,

Erik.

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2355694


Re: [Issue 3347] svn never times out when (public) IP changes

Posted by Stefan Sperling <st...@elego.de>.
On Fri, May 29, 2009 at 12:02:26AM -0400, Greg Hudson wrote:
> Looks good to me.

Thanks, r37890.

Stefan

Re: [Issue 3347] svn never times out when (public) IP changes

Posted by Greg Hudson <gh...@mit.edu>.
Looks good to me.

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2356653

Re: [Issue 3347] svn never times out when (public) IP changes

Posted by Stefan Sperling <st...@elego.de>.
On Wed, May 27, 2009 at 10:34:00PM +0100, Stefan Sperling wrote:
> Right. So we would be using keep-alives as an out-of-band mechanism,
> in the sense that we ask TCP to please tell us if we get a reset from
> the other side when sending non-svn-protocol messages. The svn protocol
> itself is then isolated from having to deal with this network-layer
> problem.
> 
> I see your point and will try to find a way to do this.

Here's a patch.

I've tested it by:

1) Configuring my kernel to start sending keep-alives after 20 seconds
   instead of 2 hours (the default).

2) Running a large checkout over svn:// from a remote server.

3) Causing the connection to stall by making my router drop any packets
   sent from the server to the client. This simulates the situation were
   the client is no longer able to receive traffic from the server because
   its connection to the public internet has gone down.

4) Waiting for some time.

5) Inserting another firewall rule which sends a TCP RST packet for any
   TCP packet going to the server. This simulates the situation were the
   client's public internet connection comes back, but with a different
   IP address assigned to it by the ISP, so that the server will reply
   with a RST because it does not recognize the client anymore.

6) All the way through, monitoring the traffic using tcpdump, making
   sure that there is no traffic generated by the client itself,
   because this would cause keep-alives not to be sent. Also, making
   sure that the kernel did send keep-alives after 20 seconds. On Linux,
   these seem to show up as a succession of a couple of ACKs for the
   same segment sent from the client to the server. At least that is
   what I saw and it is sufficiently different from the other traffic
   to signify that keep-alives were in fact being used.

Without the patch, svn just sat there waiting seemingly forever,
even after the RST firewall rule was inserted.

With the patch, svn threw an error after the quick succession of
ACKS appearing in the tcpdump output:

subversion/svn/checkout-cmd.c:168: (apr_err=104)
subversion/libsvn_client/checkout.c:224: (apr_err=104)
subversion/libsvn_client/update.c:275: (apr_err=104)
subversion/libsvn_ra_svn/client.c:281: (apr_err=104)
subversion/libsvn_ra_svn/editorp.c:877: (apr_err=104)
subversion/libsvn_ra_svn/marshal.c:793: (apr_err=104)
subversion/libsvn_ra_svn/marshal.c:644: (apr_err=104)
subversion/libsvn_ra_svn/marshal.c:644: (apr_err=104)
subversion/libsvn_ra_svn/marshal.c:644: (apr_err=104)
subversion/libsvn_ra_svn/marshal.c:610: (apr_err=104)
subversion/libsvn_ra_svn/marshal.c:297: (apr_err=104)
subversion/libsvn_ra_svn/marshal.c:287: (apr_err=104)
subversion/libsvn_ra_svn/marshal.c:262: (apr_err=104)
subversion/libsvn_ra_svn/streams.c:152: (apr_err=104)
svn: Can't read from connection: Connection reset by peer

The error originates at the call to apr_socket_recv().

Do you agree that this patch is OK and addresses issue #3347 correctly?

Thanks,
Stefan

[[[
Fix issue #3347, "svn never times out when (public) IP changes"

* subversion/libsvn_ra_svn/client.c
  (make_connection): After creating the socket, try to enable TCP
   keep-alives on it so that we have a chance of detecting connection
   problems to the server while we are blocking indefinitely reading
   from the socket.

Suggested by: ghudson
]]]

Index: subversion/libsvn_ra_svn/client.c
===================================================================
--- subversion/libsvn_ra_svn/client.c	(revision 37867)
+++ subversion/libsvn_ra_svn/client.c	(working copy)
@@ -161,6 +161,21 @@ static svn_error_t *make_connection(const char *ho
     return svn_error_wrap_apr(status, _("Can't connect to host '%s'"),
                               hostname);
 
+  /* Enable TCP keep-alives on the socket so we time out when
+   * the connection breaks due to network-layer problems.
+   * If the peer has dropped the connection due to a network partition
+   * or a crash, or if the peer no longer considers the connection
+   * valid because we are behind a NAT and our public IP has changed,
+   * it will respond to the keep-alive probe with a RST instead of an
+   * acknowledgment segment, which will cause svn to abort the session
+   * even while it is currently blocked waiting for data from the peer.
+   * See issue #3347. */
+  status = apr_socket_opt_set(*sock, APR_SO_KEEPALIVE, 1);
+  if (status)
+    {
+      /* It's not a fatal error if we cannot enable keep-alives. */
+    }
+
   return SVN_NO_ERROR;
 }
 

Re: [Issue 3347] svn never times out when (public) IP changes

Posted by Stefan Sperling <st...@elego.de>.
On Wed, May 27, 2009 at 04:24:24PM -0400, Greg Hudson wrote:
> On Wed, 2009-05-27 at 20:23 +0100, Stefan Sperling wrote:
> >    A TCP keep-alive mechanism should only be invoked in
> >    server applications that might otherwise hang
> >    indefinitely and consume resources unnecessarily if a
> >    client crashes or aborts a connection during a network
> >    failure.
> 
> That RFC is paranoid about saturating links with unnecessary keepalive
> requests, which is an outdated concern.
> 
> I'm confident that every TCP implementation supports keepalives.
> 
> > Still, could you explain more precisely why you would not want us
> > to timeout after some predefined amount of time?
> 
> Just because we don't have a bound on how long each operation should
> take.  If the other side is present and responding to keepalives, then
> either (1) it's still working, or (2) we have a bug and the protocol got
> deadlocked.  In the first case, we shouldn't time out; in the second
> case, timeouts are a poor workaround for bugs.

Right. So we would be using keep-alives as an out-of-band mechanism,
in the sense that we ask TCP to please tell us if we get a reset from
the other side when sending non-svn-protocol messages. The svn protocol
itself is then isolated from having to deal with this network-layer
problem.

I see your point and will try to find a way to do this.

Thanks,
Stefan

Re: [Issue 3347] svn never times out when (public) IP changes

Posted by Greg Hudson <gh...@mit.edu>.
On Wed, 2009-05-27 at 20:23 +0100, Stefan Sperling wrote:
>    A TCP keep-alive mechanism should only be invoked in
>    server applications that might otherwise hang
>    indefinitely and consume resources unnecessarily if a
>    client crashes or aborts a connection during a network
>    failure.

That RFC is paranoid about saturating links with unnecessary keepalive
requests, which is an outdated concern.

I'm confident that every TCP implementation supports keepalives.

> Still, could you explain more precisely why you would not want us
> to timeout after some predefined amount of time?

Just because we don't have a bound on how long each operation should
take.  If the other side is present and responding to keepalives, then
either (1) it's still working, or (2) we have a bug and the protocol got
deadlocked.  In the first case, we shouldn't time out; in the second
case, timeouts are a poor workaround for bugs.

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2355926

Re: [Issue 3347] svn never times out when (public) IP changes

Posted by Stefan Sperling <st...@elego.de>.
On Wed, May 27, 2009 at 12:50:59PM -0400, Greg Hudson wrote:
> As a matter of history, I've not been impressed with application-level
> hardcoded timeouts, so I didn't put any into ra_svn.  It's not the case,
> as you said in #3347, that I explicitly removed some default timeout
> present in APR or the operating system; those layers don't have default
> timeouts.
> 
> What I would recommend to address #3347 is setting the SO_KEEPALIVE flag
> on the socket, which will cause the underlying TCP layer to use
> keepalive packets to eventually notice when the other half of the
> network connection has disappeared without sending a reset.  I'm not
> sure how to do that through APR, but it's probably not too difficult.

OK, so I've tried to educate myself a bit.

I'm not really sure whether we should use TCP keep-alives.
They're an extension to TCP, though I'd suspect most operating systems
would support them these days. But RFC1122 states:

   A TCP keep-alive mechanism should only be invoked in
   server applications that might otherwise hang
   indefinitely and consume resources unnecessarily if a
   client crashes or aborts a connection during a network
   failure.

This does not apply to our sitation. We're not a server, we're
a client behind a NAT. Granted, RFC1122 dates from 1989 so
NAT wasn't even invented yet :) RFC1631 was written in 1994.

Still, could you explain more precisely why you would not want us
to timeout after some predefined amount of time? Because that is
really easy to do and is guaranteed to work on any TCP implementation,
even if it does not feature keep-alives. If we can agree on a reasonable
timeout which is likely to indicate problems, say, 2 hours, why not
just do it? I can't think of any reason why an svn connection stalled
for 2 hours or more should not have to be restarted.

Thanks,
Stefan

Re: [Issue 3347] svn never times out when (public) IP changes

Posted by Greg Hudson <gh...@mit.edu>.
On Wed, 2009-05-27 at 18:20 +0100, Stefan Sperling wrote:
> Please let me know
> if you can find anything in it that indicates that the code in question
> was moved there from elsewhere, or even that it was written by you.

It's in these two statements:

> Encapsulate ra_svn's I/O with a stream-based wrapper.
[...]
>   (svn_ra_svn__set_block_handler, svn_ra_svn__input_waiting,
>    writebuf_output, readbuf_input): Use the new svn_ra_svn__stream_t interface
>    instead of the old apr_file_t/apr_socket_t code.

I realize it's a lot easier for me to interpret these than it is for
you, and if you made an attempt to investigate but ran up against a
wall, that's probably fine.

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2355844

Re: [Issue 3347] svn never times out when (public) IP changes

Posted by Stefan Sperling <st...@elego.de>.
On Wed, May 27, 2009 at 12:50:59PM -0400, Greg Hudson wrote:
> I looked over the code, and I believe you seriously broke things in
> r37838, without improving anything related to the issue.

Ooops! Too bad the regression tests we have didn't catch this... :(

> The default timeout for reading from an APR (or Unix) socket is
> infinity.  During pipelining, we set the timeout to 0 for some write
> opreations so that we can try them without blocking.  If APR had
> separate timeouts for read and write, I would have only set the write
> timeout in the first place, and that code in sock_read_cb would be
> unnecessary.  The comment could perhaps have explained this better, but
> I'm not sure if it could have explained it well enough to dispel your
> assumptions about what the code was doing.

Right, that clears up the mystery. Thank you!
 
> What you did was remove the code which reverts back to blocking, leaving
> the timeout at 0 in certain situations.  I don't know exactly what
> symptoms this would cause, but I imagine they're nothing good.

Yeah that's bad. I'll revert.

> As a matter of history, I've not been impressed with application-level
> hardcoded timeouts, so I didn't put any into ra_svn.  It's not the case,
> as you said in #3347, that I explicitly removed some default timeout
> present in APR or the operating system; those layers don't have default
> timeouts.
> 
> What I would recommend to address #3347 is setting the SO_KEEPALIVE flag
> on the socket, which will cause the underlying TCP layer to use
> keepalive packets to eventually notice when the other half of the
> network connection has disappeared without sending a reset.  I'm not
> sure how to do that through APR, but it's probably not too difficult.

Thanks for the hint. I'll try looking into this, too see if it is
possible to fix this properly.

> As a matter of process, you should never assume that a piece of code is
> present for no reason when you have the means to investigate why it's
> there.  Using svn blame, you could have (after tracing back through the
> code movement in r22238) seen that the code was introduced in r7532, and
> correctly interpreted it as part of the pipelining support.

Well, I used blame, and log, to try to find out what the heck was going on.

I ended up assuming that the code was introduced as part of streams.c when
it was created. Below is the log message for r22238. Please let me know
if you can find anything in it that indicates that the code in question
was moved there from elsewhere, or even that it was written by you.
Because I cannot find anything. I only found out that you wrote the
code because Erik pointed it out in this thread.

The only thing I didn't look at was the diff of that revision.
That may of course have revealed some information.
But if someone just says "New file" in a log message without giving
further information, then I can rightly assume that the file is in
fact new and does not contain anything from elsewhere.
Moving symbols across files is not "New file". Part of moving symbols
is mentioning their old location.

So, sorry about that. I tried my best to follow process, but sometimes
that alone is just not enough. I'm glad you wrote your email, because
otherwise I would have had to find out the hard way that my "fix" was
not a fix at all.

------------------------------------------------------------------------
r22238 | rooneg | 2006-11-08 23:07:54 +0000 (Wed, 08 Nov 2006) | 30 lines

Encapsulate ra_svn's I/O with a stream-based wrapper. This will
facilitate the introduction of SASL and TLS encryption.

Patch by: Vlad Georgescu <vg...@gmail.com>

* subversion/libsvn_ra_svn/marshal.c:
  Update the copyright date.
  (svn_ra_svn_create_conn): Create the connection stream. Don't initialize
   in_file and out_file.
  (svn_ra_svn__set_block_handler, svn_ra_svn__input_waiting,
   writebuf_output, readbuf_input): Use the new svn_ra_svn__stream_t interface
   instead of the old apr_file_t/apr_socket_t code.

* subversion/libsvn_ra_svn/ra_svn.h
  (ra_svn_pending_fn_t,
   ra_svn_timeout_fn_t,
   svn_ra_svn__stream_t): New typedefs.
  (svn_ra_svn_conn_st): Add stream. Remove in_file, out_file and proc. Explain
   that direct access to sock is still required by SASL.
  (svn_ra_svn__stream_from_sock,
   svn_ra_svn__stream_from_files,
   svn_ra_svn__stream_create,
   svn_ra_svn__stream_write,
   svn_ra_svn__stream_read,
   svn_ra_svn__stream_timeout,
   svn_ra_svn__stream_pending): New function declarations.

* subversion/libsvn_ra_svn/streams.c: New file. Implements the
  svn_ra_svn__stream_t interface for socket and file streams.

------------------------------------------------------------------------

Stefan

Re: [Issue 3347] svn never times out when (public) IP changes

Posted by Greg Hudson <gh...@mit.edu>.
I looked over the code, and I believe you seriously broke things in
r37838, without improving anything related to the issue.

The default timeout for reading from an APR (or Unix) socket is
infinity.  During pipelining, we set the timeout to 0 for some write
opreations so that we can try them without blocking.  If APR had
separate timeouts for read and write, I would have only set the write
timeout in the first place, and that code in sock_read_cb would be
unnecessary.  The comment could perhaps have explained this better, but
I'm not sure if it could have explained it well enough to dispel your
assumptions about what the code was doing.

What you did was remove the code which reverts back to blocking, leaving
the timeout at 0 in certain situations.  I don't know exactly what
symptoms this would cause, but I imagine they're nothing good.

As a matter of history, I've not been impressed with application-level
hardcoded timeouts, so I didn't put any into ra_svn.  It's not the case,
as you said in #3347, that I explicitly removed some default timeout
present in APR or the operating system; those layers don't have default
timeouts.

What I would recommend to address #3347 is setting the SO_KEEPALIVE flag
on the socket, which will cause the underlying TCP layer to use
keepalive packets to eventually notice when the other half of the
network connection has disappeared without sending a reset.  I'm not
sure how to do that through APR, but it's probably not too difficult.

As a matter of process, you should never assume that a piece of code is
present for no reason when you have the means to investigate why it's
there.  Using svn blame, you could have (after tracing back through the
code movement in r22238) seen that the code was introduced in r7532, and
correctly interpreted it as part of the pipelining support.

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2355831

Re: [Issue 3347] svn never times out when (public) IP changes

Posted by Stefan Sperling <st...@elego.de>.
On Wed, May 27, 2009 at 02:44:53PM +0200, Erik Huelsmann wrote:
> On Wed, May 27, 2009 at 2:41 PM, Stefan Sperling <st...@elego.de> wrote:
> > On Wed, May 27, 2009 at 11:38:48AM +0200, Erik Huelsmann wrote:
> >> On Wed, May 27, 2009 at 1:28 AM, Stefan Sperling <st...@elego.de> wrote:
> >> > http://subversion.tigris.org/issues/show_bug.cgi?id=3347
> >> >
> >> >
> >> >
> >> > User stsp changed the following:
> >> >
> >> >                What    |Old value                 |New value
> >> > ================================================================================
> >> >                  Status|NEW                       |RESOLVED
> >> > --------------------------------------------------------------------------------
> >> >              Resolution|                          |FIXED
> >> > --------------------------------------------------------------------------------
> >> >
> >> >
> >> >
> >> >
> >> > ------- Additional comments from stsp@tigris.org Tue May 26 16:27:59 -0700 2009 -------
> >> > I've finally looked at this again, and there was indeed code which
> >> > explicitly set the timeout for a socket to infinity when reading
> >> > data from a socket used for the svn:// protocol. Very bad.
> >>
> >> Actually, that was by design. Please ask Greg Hudson why, but I know
> >> he's been referring to timeouts as used by HTTP being a real pain in
> >> the butt for serious application development.
> >
> > Hmm... I've only changed ra_svn.
> > Does that have anything to do with HTTP?
> 
> Yes: HTTP already had a time out like that. ra_svn explicitly didn't.
> So, now you changed things to be like HTTP - a situation Greg
> explicitly excluded in his original code. That's all I meant to say.

Ah, OK.

Well, we can set the timeout to anything we want.

But I'd say an infinate timeout is just not a good idea.
We can set it to 6 hours or so for all I care, but infinity is just
silly because it creates problems like issue #3347.

Currently the default timeout is used, and I don't even know what
that timeout is. But all tests pass over ra_svn, so we'll need more
specific information as to what the timeout requirements for ra_svn
are, and why.

A better comment in the original code would have helped a lot, btw.
It said /* Always block on read. */. Well, reading the code gives
you the same information as the comment. But I could not find any
information as to *why* this was done.

Stefan


Re: [Issue 3347] svn never times out when (public) IP changes

Posted by Erik Huelsmann <eh...@gmail.com>.
On Wed, May 27, 2009 at 2:41 PM, Stefan Sperling <st...@elego.de> wrote:
> On Wed, May 27, 2009 at 11:38:48AM +0200, Erik Huelsmann wrote:
>> On Wed, May 27, 2009 at 1:28 AM, Stefan Sperling <st...@elego.de> wrote:
>> > http://subversion.tigris.org/issues/show_bug.cgi?id=3347
>> >
>> >
>> >
>> > User stsp changed the following:
>> >
>> >                What    |Old value                 |New value
>> > ================================================================================
>> >                  Status|NEW                       |RESOLVED
>> > --------------------------------------------------------------------------------
>> >              Resolution|                          |FIXED
>> > --------------------------------------------------------------------------------
>> >
>> >
>> >
>> >
>> > ------- Additional comments from stsp@tigris.org Tue May 26 16:27:59 -0700 2009 -------
>> > I've finally looked at this again, and there was indeed code which
>> > explicitly set the timeout for a socket to infinity when reading
>> > data from a socket used for the svn:// protocol. Very bad.
>>
>> Actually, that was by design. Please ask Greg Hudson why, but I know
>> he's been referring to timeouts as used by HTTP being a real pain in
>> the butt for serious application development.
>
> Hmm... I've only changed ra_svn.
> Does that have anything to do with HTTP?

Yes: HTTP already had a time out like that. ra_svn explicitly didn't.
So, now you changed things to be like HTTP - a situation Greg
explicitly excluded in his original code. That's all I meant to say.

Bye,

Erik.

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2355752


Re: [Issue 3347] svn never times out when (public) IP changes

Posted by Stefan Sperling <st...@elego.de>.
On Wed, May 27, 2009 at 11:38:48AM +0200, Erik Huelsmann wrote:
> On Wed, May 27, 2009 at 1:28 AM, Stefan Sperling <st...@elego.de> wrote:
> > http://subversion.tigris.org/issues/show_bug.cgi?id=3347
> >
> >
> >
> > User stsp changed the following:
> >
> >                What    |Old value                 |New value
> > ================================================================================
> >                  Status|NEW                       |RESOLVED
> > --------------------------------------------------------------------------------
> >              Resolution|                          |FIXED
> > --------------------------------------------------------------------------------
> >
> >
> >
> >
> > ------- Additional comments from stsp@tigris.org Tue May 26 16:27:59 -0700 2009 -------
> > I've finally looked at this again, and there was indeed code which
> > explicitly set the timeout for a socket to infinity when reading
> > data from a socket used for the svn:// protocol. Very bad.
> 
> Actually, that was by design. Please ask Greg Hudson why, but I know
> he's been referring to timeouts as used by HTTP being a real pain in
> the butt for serious application development.

Hmm... I've only changed ra_svn.
Does that have anything to do with HTTP?

Stefan