You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Bob Denny <rd...@dc3.com> on 2009/10/15 16:38:23 UTC

Issue #2580 revisited: Windows unclean TCP close [SEVERE]

SVN 1.6.5 on Windows and using svn+ssh fails to properly complete and close the TCP connection. leaving the remote sshd/svnserve processes running (CentOS Linux server). These last "forever", accumulating until the sysop disables your SSH access :-) This is a severe problem. It doesn't always appear on slower CPUs with fast connections to the remote svn+ssh repo. However, my system is a 2.6GHz quadcore, and the remote repo is across the public Internet. This is the worst case environment as you'll see. I am certain this problem is not unique to me.

I'm looking for a "buddy" on this. I am a TortoiseSVN user. TortoiseSVN shares much code with Subversion. I have checked out the latest TortoiseSVN trunk (which includes Subversion 1.6.5 as an external) and built it. I hope I won't have to go through this again for the Subversion sources in order to achieve the credibility I need here...

It is definitely an issue with Subversion, specifically libsvn_ra_svn and its usage of apr. As of SVN 1.6.5, the svn client forcibly kills the tunnel subprocess as soon as it receives the last of the data that it expects. This prevents the tunnel proc from completing its conversation with the remote sshd, leaving it and its child svnserve haning there for up to a half hour. In my case the tunnel is TortoisePLink, but it also happens with the OpenSSH 'ssh.exe' tunnel, for the same reason.

Specifically, in libsvn_ra_svn\client.c, a call is made

apr_pool_note_subprocess(pool, proc, APR_KILL_ONLY_ONCE);

Inspection of the apr sources reveals that, on Windows, the only significant kill modes are APR_KILL_ALWAYS and APR_KILL_NEVER. Any other modes (e.g. APR_KILL_ONCE) are translated on Windows to APR_KILL_ALWAYS, resulting in the svn client doing an immediate TerminateProcess() on the tunnel, preventing it from cleanly closing the TCP connection.

Here is the tail of a packet trace showing the unclean close:

43 14.439453 svn.dc3.com bob.dc3.com TCP:Flags=...AP..., SrcPort=7822 ...
44 14.439453 bob.dc3.com svn.dc3.com TCP:Flags=...AP..., SrcPort=4206 ...
45 14.529297 svn.dc3.com bob.dc3.com TCP:Flags=...AP..., SrcPort=7822 ...
46 14.529297 bob.dc3.com svn.dc3.com TCP:Flags=...AP..., SrcPort=4206 ...
47 14.611328 svn.dc3.com bob.dc3.com TCP:Flags=...AP..., SrcPort=7822 ...
48 14.613281 bob.dc3.com svn.dc3.com TCP:Flags=...AP..., SrcPort=4206 ...
49 14.693359 svn.dc3.com bob.dc3.com TCP:Flags=...AP..., SrcPort=7822 ...
50 14.714844 bob.dc3.com svn.dc3.com TCP:Flags=...A.R.., SrcPort=4206 ...

The 'R' flag means a reset packet. The remote sshd tried to send an ACK to me, and my tunnel was killed, so my TCP stack sent back this reset packet saying "forget it, this TCP connection is gone". sshd goes into bozo mode and leaves its svnserve subprocess running.

A tail of the tunnel packet log (from an 'svn ls' command) looks like this:

  00000210  29 20 28 20 37 3a 69 6e 73 74 72 65 67 20 64 69  ) ( 7:instreg di
  00000220  72 20 30 20 66 61 6c 73 65 20 30 20 28 20 32 37  r 0 false 0 ( 27
  00000230  3a 31 39 37 30 2d 30 31 2d 30 31 54 30 30 3a 30  :1970-01-01T00:0
  00000240  30 3a 30 30 2e 30 30 30 30 30 30 5a 20 29 20 28  0:00.000000Z ) (
  00000250  20 29 20 29 20 28 20 37 3a 63 6c 61 70 61 63 6b   ) ) ( 7:clapack
  00000260  20 64 69 72 20 30 20 66 61 6c 73 65 20 30 20 28   dir 0 false 0 (
  00000270  20 32 37 3a 31 39 37 30 2d 30 31 2d 30 31 54 30   27:1970-01-01T0
  00000280  30 3a 30 30 3a 30 30 2e 30 30 30 30 30 30 5a 20  0:00:00.000000Z 
  00000290  29 20 28 20 29 20 29 20 29 20 29 20 29 20        ) ( ) ) ) ) ) 
Outgoing packet type 96 / 0x60 (SSH2_MSG_CHANNEL_EOF)
  00000000  00 00 00 00                                      ....

I changed the line of code in client.c to

apr_pool_note_subprocess(pool, proc, APR_KILL_NEVER);

and the problem is cured. Here is the tail of a packet trace showing the proper TCP close:

42 13.656250  svn.dc3.com bob.dc3.com TCP:Flags=...AP..., SrcPort=7822 ...
43 13.656250  bob.dc3.com svn.dc3.com TCP:Flags=...AP..., SrcPort=4197 ...
44 13.734375  svn.dc3.com bob.dc3.com TCP:Flags=...AP..., SrcPort=7822 ...
45 13.734375  bob.dc3.com svn.dc3.com TCP:Flags=...AP..., SrcPort=4197 ...
46 13.812500  svn.dc3.com bob.dc3.com TCP:Flags=...AP..., SrcPort=7822 ...
47 13.828125  bob.dc3.com svn.dc3.com TCP:Flags=...AP..., SrcPort=4197 ...
48 13.906250  svn.dc3.com bob.dc3.com TCP:Flags=...AP..., SrcPort=7822 ...
49 13.906250  bob.dc3.com svn.dc3.com TCP:Flags=...AP..., SrcPort=4197 ...
50 13.906250  bob.dc3.com svn.dc3.com TCP:Flags=...A...F, SrcPort=4197 ...
51 13.984375  svn.dc3.com bob.dc3.com TCP:Flags=...A...., SrcPort=7822 ...
52 13.984375  svn.dc3.com bob.dc3.com TCP:Flags=...A...., SrcPort=7822 ...
53 13.984375  svn.dc3.com bob.dc3.com TCP:Flags=...A...F, SrcPort=7822 ...
54 13.984375  bob.dc3.com svn.dc3.com TCP:Flags=...A...., SrcPort=4197 ...

You can see the tunnel and the remote sshd exchanging FIN packets. And here is the tail of the tunnel packet log for the svn ls command:

  00000210  29 20 28 20 37 3a 69 6e 73 74 72 65 67 20 64 69  ) ( 7:instreg di
  00000220  72 20 30 20 66 61 6c 73 65 20 30 20 28 20 32 37  r 0 false 0 ( 27
  00000230  3a 31 39 37 30 2d 30 31 2d 30 31 54 30 30 3a 30  :1970-01-01T00:0
  00000240  30 3a 30 30 2e 30 30 30 30 30 30 5a 20 29 20 28  0:00.000000Z ) (
  00000250  20 29 20 29 20 28 20 37 3a 63 6c 61 70 61 63 6b   ) ) ( 7:clapack
  00000260  20 64 69 72 20 30 20 66 61 6c 73 65 20 30 20 28   dir 0 false 0 (
  00000270  20 32 37 3a 31 39 37 30 2d 30 31 2d 30 31 54 30   27:1970-01-01T0
  00000280  30 3a 30 30 3a 30 30 2e 30 30 30 30 30 30 5a 20  0:00:00.000000Z 
  00000290  29 20 28 20 29 20 29 20 29 20 29 20 29 20        ) ( ) ) ) ) ) 
Outgoing packet type 96 / 0x60 (SSH2_MSG_CHANNEL_EOF)
  00000000  00 00 00 00                                      ....
Event Log: Sent EOF message
Incoming packet type 96 / 0x60 (SSH2_MSG_CHANNEL_EOF)
  00000000  00 00 01 00                                      ....
Incoming packet type 98 / 0x62 (SSH2_MSG_CHANNEL_REQUEST)
  00000000  00 00 01 00 00 00 00 0b 65 78 69 74 2d 73 74 61  ........exit-sta
  00000010  74 75 73 00 00 00 00 00                          tus.....
Event Log: Server sent command exit status 0
Incoming packet type 97 / 0x61 (SSH2_MSG_CHANNEL_CLOSE)
  00000000  00 00 01 00                                      ....
Outgoing packet type 97 / 0x61 (SSH2_MSG_CHANNEL_CLOSE)
  00000000  00 00 00 00                                      ....
Event Log: Disconnected: All channels closed

You can see that the tunnel was allowed to complete its activity and exit through its normal paths. The remote sshd/svnserve processes exit cleanly as well. Problem solved!

RECOMMENDATIONS:

The cleanest and safest way to handle this would seem to be:

1. Modify apr to support the kill mode APR_KILL_AFTER_TIMEOUT *on Windows*. This would cause the tunnel to be killed after three seconds, presumably plenty of time. 

2. Modify libsvn_ra_svn\client.c to use APR_KILL_AFTER_TIMEOUT *on Windows* instead of APR_KILL_ONCE.

However, patching apr is probably not acceptable, so instead:

1. Modify libsvn_ra_svn\client.c to use APR_KILL_NEVER *on Windows* instead of APR_KILL_ONCE.

What do you think?

  -- Bob Denny

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2407949

Re: Issue #2580 revisited: Windows unclean TCP close [SEVERE]

Posted by Stefan Sperling <st...@elego.de>.
On Thu, Oct 15, 2009 at 09:38:23AM -0700, Bob Denny wrote:
> As of SVN 1.6.5, the svn client forcibly kills
> the tunnel subprocess as soon as it receives the last of the data that
> it expects.

This is not entirely correct. We have been killing the ssh client
for a long time.

What's new in 1.6.5 is that we now send SIGTERM instead of SIGKILL,
which allows an SSH process to clean up a master socket it might be
managing for for SSH connection pooling.

But it's *not* a new problem in 1.6.5.

> This prevents the tunnel proc from completing its
> conversation with the remote sshd, leaving it and its child svnserve
> haning there for up to a half hour. In my case the tunnel is
> TortoisePLink, but it also happens with the OpenSSH 'ssh.exe' tunnel,
> for the same reason.
> 
> Specifically, in libsvn_ra_svn\client.c, a call is made
> 
> apr_pool_note_subprocess(pool, proc, APR_KILL_ONLY_ONCE);
>
> Inspection of the apr sources reveals that, on Windows, the only
> significant kill modes are APR_KILL_ALWAYS and APR_KILL_NEVER. Any
> other modes (e.g. APR_KILL_ONCE) are translated on Windows to
> APR_KILL_ALWAYS, resulting in the svn client doing an immediate
> TerminateProcess() on the tunnel, preventing it from cleanly closing
> the TCP connection.
> 
> Here is the tail of a packet trace showing the unclean close:
> 
> 43 14.439453 svn.dc3.com bob.dc3.com TCP:Flags=...AP..., SrcPort=7822 ...
> 44 14.439453 bob.dc3.com svn.dc3.com TCP:Flags=...AP..., SrcPort=4206 ...
> 45 14.529297 svn.dc3.com bob.dc3.com TCP:Flags=...AP..., SrcPort=7822 ...
> 46 14.529297 bob.dc3.com svn.dc3.com TCP:Flags=...AP..., SrcPort=4206 ...
> 47 14.611328 svn.dc3.com bob.dc3.com TCP:Flags=...AP..., SrcPort=7822 ...
> 48 14.613281 bob.dc3.com svn.dc3.com TCP:Flags=...AP..., SrcPort=4206 ...
> 49 14.693359 svn.dc3.com bob.dc3.com TCP:Flags=...AP..., SrcPort=7822 ...
> 50 14.714844 bob.dc3.com svn.dc3.com TCP:Flags=...A.R.., SrcPort=4206 ...
> 
> The 'R' flag means a reset packet. The remote sshd tried to send an
> ACK to me, and my tunnel was killed, so my TCP stack sent back this
> reset packet saying "forget it, this TCP connection is gone". sshd
> goes into bozo mode and leaves its svnserve subprocess running.
> 
> A tail of the tunnel packet log (from an 'svn ls' command) looks like this:
> 
>   00000210  29 20 28 20 37 3a 69 6e 73 74 72 65 67 20 64 69  ) ( 7:instreg di
>   00000220  72 20 30 20 66 61 6c 73 65 20 30 20 28 20 32 37  r 0 false 0 ( 27
>   00000230  3a 31 39 37 30 2d 30 31 2d 30 31 54 30 30 3a 30  :1970-01-01T00:0
>   00000240  30 3a 30 30 2e 30 30 30 30 30 30 5a 20 29 20 28  0:00.000000Z ) (
>   00000250  20 29 20 29 20 28 20 37 3a 63 6c 61 70 61 63 6b   ) ) ( 7:clapack
>   00000260  20 64 69 72 20 30 20 66 61 6c 73 65 20 30 20 28   dir 0 false 0 (
>   00000270  20 32 37 3a 31 39 37 30 2d 30 31 2d 30 31 54 30   27:1970-01-01T0
>   00000280  30 3a 30 30 3a 30 30 2e 30 30 30 30 30 30 5a 20  0:00:00.000000Z 
>   00000290  29 20 28 20 29 20 29 20 29 20 29 20 29 20        ) ( ) ) ) ) ) 
> Outgoing packet type 96 / 0x60 (SSH2_MSG_CHANNEL_EOF)
>   00000000  00 00 00 00                                      ....
> 
> I changed the line of code in client.c to
> 
> apr_pool_note_subprocess(pool, proc, APR_KILL_NEVER);
> 
> and the problem is cured.

But this creates another problem which killing the subprocess is
meant to cure. If long-lived Subversion clients (e.g. IDE integrations)
using the SVN libraries open a connection via svn+ssh:// it's possible
that the ssh subprocesses created never die. Over time, ssh processes
accumulate on the client workstation, doing nothing but waiting to be
killed.

See the comment above the line you're quoting:

  /* Arrange for the tunnel agent to get a SIGTERM on pool
   * cleanup.  This is a little extreme, but the alternatives
   * weren't working out.
   *
   * Closing the pipes and waiting for the process to die
   * was prone to mysterious hangs which are difficult to
   * diagnose (e.g. svnserve dumps core due to unrelated bug;
   * sshd goes into zombie state; ssh connection is never
   * closed; ssh never terminates).
   * See also the long dicussion in issue #2580 if you really
   * want to know various reasons for these problems and
   * the different opinions on this issue.
   */

> Here is the tail of a packet trace showing the proper TCP close:
> 
> 42 13.656250  svn.dc3.com bob.dc3.com TCP:Flags=...AP..., SrcPort=7822 ...
> 43 13.656250  bob.dc3.com svn.dc3.com TCP:Flags=...AP..., SrcPort=4197 ...
> 44 13.734375  svn.dc3.com bob.dc3.com TCP:Flags=...AP..., SrcPort=7822 ...
> 45 13.734375  bob.dc3.com svn.dc3.com TCP:Flags=...AP..., SrcPort=4197 ...
> 46 13.812500  svn.dc3.com bob.dc3.com TCP:Flags=...AP..., SrcPort=7822 ...
> 47 13.828125  bob.dc3.com svn.dc3.com TCP:Flags=...AP..., SrcPort=4197 ...
> 48 13.906250  svn.dc3.com bob.dc3.com TCP:Flags=...AP..., SrcPort=7822 ...
> 49 13.906250  bob.dc3.com svn.dc3.com TCP:Flags=...AP..., SrcPort=4197 ...
> 50 13.906250  bob.dc3.com svn.dc3.com TCP:Flags=...A...F, SrcPort=4197 ...
> 51 13.984375  svn.dc3.com bob.dc3.com TCP:Flags=...A...., SrcPort=7822 ...
> 52 13.984375  svn.dc3.com bob.dc3.com TCP:Flags=...A...., SrcPort=7822 ...
> 53 13.984375  svn.dc3.com bob.dc3.com TCP:Flags=...A...F, SrcPort=7822 ...
> 54 13.984375  bob.dc3.com svn.dc3.com TCP:Flags=...A...., SrcPort=4197 ...
> 
> You can see the tunnel and the remote sshd exchanging FIN packets. And here is the tail of the tunnel packet log for the svn ls command:
> 
>   00000210  29 20 28 20 37 3a 69 6e 73 74 72 65 67 20 64 69  ) ( 7:instreg di
>   00000220  72 20 30 20 66 61 6c 73 65 20 30 20 28 20 32 37  r 0 false 0 ( 27
>   00000230  3a 31 39 37 30 2d 30 31 2d 30 31 54 30 30 3a 30  :1970-01-01T00:0
>   00000240  30 3a 30 30 2e 30 30 30 30 30 30 5a 20 29 20 28  0:00.000000Z ) (
>   00000250  20 29 20 29 20 28 20 37 3a 63 6c 61 70 61 63 6b   ) ) ( 7:clapack
>   00000260  20 64 69 72 20 30 20 66 61 6c 73 65 20 30 20 28   dir 0 false 0 (
>   00000270  20 32 37 3a 31 39 37 30 2d 30 31 2d 30 31 54 30   27:1970-01-01T0
>   00000280  30 3a 30 30 3a 30 30 2e 30 30 30 30 30 30 5a 20  0:00:00.000000Z 
>   00000290  29 20 28 20 29 20 29 20 29 20 29 20 29 20        ) ( ) ) ) ) ) 
> Outgoing packet type 96 / 0x60 (SSH2_MSG_CHANNEL_EOF)
>   00000000  00 00 00 00                                      ....
> Event Log: Sent EOF message
> Incoming packet type 96 / 0x60 (SSH2_MSG_CHANNEL_EOF)
>   00000000  00 00 01 00                                      ....
> Incoming packet type 98 / 0x62 (SSH2_MSG_CHANNEL_REQUEST)
>   00000000  00 00 01 00 00 00 00 0b 65 78 69 74 2d 73 74 61  ........exit-sta
>   00000010  74 75 73 00 00 00 00 00                          tus.....
> Event Log: Server sent command exit status 0
> Incoming packet type 97 / 0x61 (SSH2_MSG_CHANNEL_CLOSE)
>   00000000  00 00 01 00                                      ....
> Outgoing packet type 97 / 0x61 (SSH2_MSG_CHANNEL_CLOSE)
>   00000000  00 00 00 00                                      ....
> Event Log: Disconnected: All channels closed
> 
> You can see that the tunnel was allowed to complete its activity and
> exit through its normal paths. The remote sshd/svnserve processes exit
> cleanly as well. Problem solved!
> 
> RECOMMENDATIONS:
> 
> The cleanest and safest way to handle this would seem to be:
> 
> 1. Modify apr to support the kill mode APR_KILL_AFTER_TIMEOUT *on
> Windows*. This would cause the tunnel to be killed after three
> seconds, presumably plenty of time. 
> 
> 2. Modify libsvn_ra_svn\client.c to use APR_KILL_AFTER_TIMEOUT *on
> Windows* instead of APR_KILL_ONCE.
> 
> However, patching apr is probably not acceptable, so instead:

Patching APR is certainly acceptable.
Would you be willing to work on a patch of APR and submit it there?

> 1. Modify libsvn_ra_svn\client.c to use APR_KILL_NEVER *on Windows*
> instead of APR_KILL_ONCE.
> 
> What do you think?

KILL_NEVER is not a good solution. If it was, it would already be done
this way, and we'd never have had to mess around with killing SSH
processes in the first place.

Stefan

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2408224

RE: Issue #2580 revisited: Windows unclean TCP close [SEVERE]

Posted by Bob Denny <rd...@dc3.com>.
Hi Paul --

> [...] it should only take a few minutes for someone to try
> running SSHD + SVNSERVE with SSHD in debug mode and see what the log
> messages say about the handshake with SVNSERVE under the circumstance you
> describe with the RST packet, especially since there are other
> router/network scenarios which would leave a zombie SVNSERVE running.

It might prove interesting, but my guess is that svnserve is happily "awaiting further instructions" from the command line svn (or whatever client) at the other end. Its parent, the server side sshd tunnel, is sitting there saying "What happened? The SSH connection is still open but I'm getting errors from sockets!" What _should_ it do? that's the question. Should it just kill itself and its child svnserve or should it also wait for a while to receive something from the other end ... like maybe a proper SSH closing handshake??? Not clear. Right now it's waiting. It can easily be argued that this is what it should do.

The bottom line is that instantly killing the SSH tunnel at the client sets off a chain of events, all of which go through error recovery paths FOR EVERY SINGLE WORK PACKAGE BETWEEN SVN AND SVNSERVE. My points are (1) that we should not be routinely running through error paths, expecting them to act like something that was intended for production, and (2) having apr instantly kill the tunnel causes (1).

 -- Bob

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2410429

RE: Issue #2580 revisited: Windows unclean TCP close [SEVERE]

Posted by Paul Charlton <te...@byiq.com>.
Bob,
I see your point entirely.  Reverting to legacy behavior is the path of
least resistance and is usually the most compatible with what is already in
the field.

That being said, it should only take a few minutes for someone to try
running SSHD + SVNSERVE with SSHD in debug mode and see what the log
messages say about the handshake with SVNSERVE under the circumstance you
describe with the RST packet, especially since there are other
router/network scenarios which would leave a zombie SVNSERVE running.  At a
minimum, get a bug filed on the server side zombie behavior -- it will most
likely be something out-of-band like an ignored signal or skipped return
code from "read" on the stdin from the parent SSHD process.

Best regrds,
Paul

> -----Original Message-----
> From: Bob Denny [mailto:rdenny@dc3.com]
> Sent: Thursday, October 22, 2009 5:47 AM
> To: dev@subversion.tigris.org
> Subject: RE: Issue #2580 revisited: Windows unclean TCP close [SEVERE]
> 
> Paul --
> 
> > I did read this entire thread ... the problem is the linux sshd/child
> > process, per http://tools.ietf.org/html/rfc1122#page-87, see section
> > 4.2.2.13.
> > Just for kicks, also see
> >
> http://en.wikipedia.org/wiki/Transmission_Control_Protocol#Connection_t
> ermin
> > ation
> 
> I'm well aware of these issues, particularly the TCP connection
> termination process. I may not have been clear enough before: by
> instantly killing the tunnel process, the proper termination of the TCP
> connection is prevented (not to mention the proper termination of the
> SVN protocol!). The Ethernet traces I posted show the node where the
> tunnel USED to be running sending a TCP RESET packet. The TCP
> connection was IMPROPERLY terminated due to the tunnel (the client end
> of the TCP connection) vaporizing before the process could run to clean
> completion.
> 
> > Under this situation, I would not advocate changing the windows
> > implementation to "correct" the mis-behavior of the linux sshd and
> its child
> > process.
> 
> Do I understand you to be saying that it's sshd's error? How can it
> tell the difference between "dead partner" and "temporarily slow or
> dead net connection"? All that TCP stuff is running a couple of layers
> down under the sockets API that sshd uses. So sshd properly sits there
> for "a while", with its child svnserve also sitting there, both waiting
> for some word from the other end. Eventually (10+ MINUTES later) sshd
> DOES exit and terminate its child. But this process is one of "forget
> it, something's really wrong", I'm outa here...
> 
> Instantly killing the tunnel (clearly the WRONG way to terminate a TCP
> connection!) sets off a chain of events, ALL OF WHICH ARE ERRORS. The
> svn protocol does not complete normally, nor does the ssh protocol, nor
> does the TCP protocol! Do we want EVERY SINGLE svn+ssh connection to
> complete through error paths? That's what's happening now, to EVERYONE
> who is using svn+ssh from a Windows client! I've already explained why
> we haven't gotten more complaints.
> 
> But I repeat: instantly killing the tunnel upon receipt of the last SVN
> data from a remote svnserve is NOT the way to shut down the connection
> between the two. It leaves the remote sshd wondering what happened and
> sitting there waiting for the next step in ITS protocol (SSH). It has
> no idea what is coming next (more data or what???). So it sits there
> waiting...
> 
> The recent change to use APR_KILL_ONCE is (imo) the right way to do it,
> but it works (sending SIGHUP to the tunnel) only on Linux, etc. I
> suspect the authors didn't realize that APR on Windows converts
> KILL_ONCE to KILL_ALWAYS and instantly kills the child (unlike the real
> KILL_ONCE on Linux, which sends SIGHUP to the tunnel, then waits for 3
> seconds to allow things to wind down properly, then kills it if it
> hasn't already exited normally).
> 
> > Also, to quote your original post -- "sshd goes into bozo mode and
> leaves
> > its svnserve subprocess running."  I haven't looked into the sources,
> but if
> > I recall correctly, sshd under this circumstance is sending "SIGHUP"
> to the
> > child process (svnserve), giving it the opportunity to flush buffers
> ...
> > which would mean that svnserve is not responding correctly to
> external
> > signals.
> 
> sshd eventually does that. But only after a long time, for the reasons
> given above.
> 
>   -- Bob
> 
> ------------------------------------------------------
> http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageI
> d=2410204

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2410241

RE: Issue #2580 revisited: Windows unclean TCP close [SEVERE]

Posted by Bob Denny <rd...@dc3.com>.
Paul --

> I did read this entire thread ... the problem is the linux sshd/child
> process, per http://tools.ietf.org/html/rfc1122#page-87, see section
> 4.2.2.13.
> Just for kicks, also see
> http://en.wikipedia.org/wiki/Transmission_Control_Protocol#Connection_termin
> ation

I'm well aware of these issues, particularly the TCP connection termination process. I may not have been clear enough before: by instantly killing the tunnel process, the proper termination of the TCP connection is prevented (not to mention the proper termination of the SVN protocol!). The Ethernet traces I posted show the node where the tunnel USED to be running sending a TCP RESET packet. The TCP connection was IMPROPERLY terminated due to the tunnel (the client end of the TCP connection) vaporizing before the process could run to clean completion.

> Under this situation, I would not advocate changing the windows
> implementation to "correct" the mis-behavior of the linux sshd and its child
> process.

Do I understand you to be saying that it's sshd's error? How can it tell the difference between "dead partner" and "temporarily slow or dead net connection"? All that TCP stuff is running a couple of layers down under the sockets API that sshd uses. So sshd properly sits there for "a while", with its child svnserve also sitting there, both waiting for some word from the other end. Eventually (10+ MINUTES later) sshd DOES exit and terminate its child. But this process is one of "forget it, something's really wrong", I'm outa here...

Instantly killing the tunnel (clearly the WRONG way to terminate a TCP connection!) sets off a chain of events, ALL OF WHICH ARE ERRORS. The svn protocol does not complete normally, nor does the ssh protocol, nor does the TCP protocol! Do we want EVERY SINGLE svn+ssh connection to complete through error paths? That's what's happening now, to EVERYONE who is using svn+ssh from a Windows client! I've already explained why we haven't gotten more complaints.

But I repeat: instantly killing the tunnel upon receipt of the last SVN data from a remote svnserve is NOT the way to shut down the connection between the two. It leaves the remote sshd wondering what happened and sitting there waiting for the next step in ITS protocol (SSH). It has no idea what is coming next (more data or what???). So it sits there waiting...

The recent change to use APR_KILL_ONCE is (imo) the right way to do it, but it works (sending SIGHUP to the tunnel) only on Linux, etc. I suspect the authors didn't realize that APR on Windows converts KILL_ONCE to KILL_ALWAYS and instantly kills the child (unlike the real KILL_ONCE on Linux, which sends SIGHUP to the tunnel, then waits for 3 seconds to allow things to wind down properly, then kills it if it hasn't already exited normally).

> Also, to quote your original post -- "sshd goes into bozo mode and leaves
> its svnserve subprocess running."  I haven't looked into the sources, but if
> I recall correctly, sshd under this circumstance is sending "SIGHUP" to the
> child process (svnserve), giving it the opportunity to flush buffers ...
> which would mean that svnserve is not responding correctly to external
> signals.

sshd eventually does that. But only after a long time, for the reasons given above.  

  -- Bob

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2410204

RE: Re: Issue #2580 revisited: Windows unclean TCP close [SEVERE]

Posted by Bob Denny <rd...@dc3.com>.
> (I'm not going to commit the patch myself simply because I cannot test
> it since I don't have a Windows machine. And I'd rather see the effects
> of what I commit with my own eyes.)

I hope someone will validate it and either apply the patch or come up with a "better" way to solve the problem. As it is now, (1.6.6) svn cannot be used via svn+ssh (and probably other tunnels) under the conditions I outlined.

> The default action for SIGHUP is terminating the process.
> I don't see anything in the svnserve source code overriding SIGHUP.

The problem is that sshd is waiting for ITS protocol (SSH) to complete before things wind down. Maybe the other end (the client) has more data to send? Since the client didn't initiate the SSH closing phase (due to it having been instantly killed) things at the server are in an indeterminate state, and eventually (many minutes later) the chain of processes on the server exit through error paths.

  -- Bob

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2410205

RE: Re: Re: Re: Issue #2580 revisited: Windows unclean TCP close [SEVERE]

Posted by Bob Denny <rd...@dc3.com>.
Daniel:
> * Without the patch, the TCP SSH connection closes with RST.
> * With the patch, it closes with double-ended FIN.

If you could see the payload data (instead of SSH encrypted) you would see that the SVN protocol between svn and svnserve also completes normally. Also you could be able to see the SSH protocol itself completing normally (though it's difficult to tell due to the encryption). I used TortoisePLink's logging capability - it logs the cleartext data going into and coming out of the tunnel - to see the SVN protocol complete normally (EOF exchange). This is part of my initial message in this thread.

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2411122

Re: Re: Re: Issue #2580 revisited: Windows unclean TCP close [SEVERE]

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.
Stefan Sperling wrote on Sat, 24 Oct 2009 at 21:40 +0200:
> I'll ask Bert, Paul and Daniel to take a look at this thread.
> If they don't have any concerns I'd say let's just commit your patch
> and hope that this stupid issue will finally die...
> 

Okay, with help from Stefan, took the patch for a quick drive.

My observetions:

* Without the patch, the TCP SSH connection closes with RST.

* With the patch, it closes with double-ended FIN.

* I don't see any leftover processes on the windows client side.

* I tested both with svn+ssh:// and with the two svn+custom:// schemes
  that I use regularly.

* I see leftover processes on the server side only when the script that
  my ssh tunnel runs (on the server) doesn't pass "-q 0" to /bin/nc.
  (On Ubuntu that's the "quit N seconds after EOF on stdin" flag.)

* For testing I used the svn client (fresh trunk) and putty/plink 0.60.
  I didn't test connection pooling because I didn't find the putty
  equivalent of openssh's ControlMaster option.

Bottom line: +0 because FIN is better than RST, I don't see any
ill effects, and there is no previous report of zombies on windows
(per Stefan's last mail).

Daniel

> Stefan
> 
> ------------------------------------------------------
> http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2411087
>

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2411095

Re: Re: Re: Issue #2580 revisited: Windows unclean TCP close [SEVERE]

Posted by Stefan Sperling <st...@elego.de>.
On Sat, Oct 24, 2009 at 10:52:55AM -0700, Bob Denny wrote:
> > I'd rather be careful when changing the default to something that
> > is known to have caused problems in the past.
> 
> Understood! But was it known to cause problems on Windows?

I don't know.

But it turns out that his has been mentioned before:
"The zombie problem is specific to unix-like systems and
does not affect windows."
http://subversion.tigris.org/issues/show_bug.cgi?id=2580#desc21

See also this version of Kyle McKay's issue #2850 patch:
http://subversion.tigris.org/nonav/issues/showattachment.cgi/993/client.c-patch_v2
It explicitly prevents the signal being sent on win32.

> What we want is for the client svn and server svnserve to be able to
> complete their work with an EOF exchange at the SVN protocol level.
> This will allow (cause) the remote svnserve to exit normally. Then
> have the client close the fd, (an EOF to the SSH tunnel) which will
> allow (cause) the client and server SSH tunnel pair to gracefully
> complete the SSH closing exchange, at which time both the local tunnel
> agent and the remote sshd will exit normally. Finally, the client
> exits normally. This is how it is designed to work. I spent a day
> reading code and protocol docs to learn about it. 

I'll ask Bert, Paul and Daniel to take a look at this thread.
If they don't have any concerns I'd say let's just commit your patch
and hope that this stupid issue will finally die...

Stefan

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2411087

RE: Re: Re: Issue #2580 revisited: Windows unclean TCP close [SEVERE]

Posted by Bob Denny <rd...@dc3.com>.
Stefan --

I apologize for my OS-related comments. I understand, and will not mention such things in the future. 

> Are you sure? OpenSSH supports it. As far as I know people can use 
> [connection pooling] on Windows e.g. with Cygwin.

I think that virtually all Windows people are using a client such as one of those I mentioned or svn.exe. I can see some setup with Windows as a tunneled server using Cygwin and OpenSSH, but I think this is a rarity, and my patch doesn't affect the server side anyway.

> But from reading the APR code it seems there is no difference at all
> between APR_KILL_ONLY_ONCE and APR_KILL_ALWAYS on windows. 

Correct. I also noted this in earlier messages.

> So the problem you are trying to fix should have existed pre-1.6.5. 
> Can you confirm this, if only to help me make sure I've understood 
> the problem?

It would be difficult, as I would have to check out and build an earlier version. I will do it if you really think it will help. Building subversion on Windows is really tricky. I spent two days trying with the instructions provided in the subversion tree and failed. I am able to build it, however, with the build tools and scripts provided in the TortoiseSVN tree in just a half hour with no problems (apart from tons of compiler warninhs about signed-unsigned comparisons and conversion from size_t to int and __int64 to int). The results pass all of the tests. But it is not a build from subversion repo, so I don't know if that's good enough.

I have no doubt that the problem exists in earlier versions though, if APR_KILL_ALWAYS is being done.

> OK, I believe that, and that's quite an amount of client coverage
> in your testing which is very good. Any problems at the server's end?

None.

> I'd rather be careful when changing the default to something that
> is known to have caused problems in the past.

Understood! But was it known to cause problems on Windows?

> What I'd like to be 100% certain about is that not killing the 
> [tunnel] will not cause any problems on the server side.

It has not in my testing. 

In fact, killing the tunnel ALWAYS causes problems on the server side on my setup with a fast client CPU and a modest (public) internet connection to the server. My first message with the ethernet traces and the Tortoise logs shows that the SVN protocol never gets a chance to close (EOF exchange), so the remote svnserve is left wondering what's next, and the SSH protocol never gets a chance to close, leaving the remote sshd wondering what's next. Eventually one of them gives up and exits though its error path, but it takes a long time.

As a reminder, with my slower laptop system, it takes longer to actually kill the tunnel after closing the fd. So the tunnel gets a chance to sneak in the SSH closing exchange. Then the remote sshd exits normally, terminating its child svnserve, which is still waiting for something else because it didn't get the svn-protocol EOF exchange, but at least the remote processes die. Even this scenario is unclean. But I suspect that is what's happening to others who are not having the problem I have. Another factor that may prevent the problem is a fast SSH connection (LAN) where it takes less time to complete the SSH closing exchange. 

What we want is for the client svn and server svnserve to be able to complete their work with an EOF exchange at the SVN protocol level. This will allow (cause) the remote svnserve to exit normally. Then have the client close the fd, (an EOF to the SSH tunnel) which will allow (cause) the client and server SSH tunnel pair to gracefully complete the SSH closing exchange, at which time both the local tunnel agent and the remote sshd will exit normally. Finally, the client exits normally. This is how it is designed to work. I spent a day reading code and protocol docs to learn about it. 

  -- Bob

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2411073

Re: Re: Issue #2580 revisited: Windows unclean TCP close [SEVERE]

Posted by Stefan Sperling <st...@elego.de>.
On Fri, Oct 23, 2009 at 07:59:23PM -0700, Bob Denny wrote:
> Stefan --
> 
> I just found this up in the thread tree, and I apologize for not
> replying sooner. This discussion thread has become confusing. I wonder
> if you read my replies to Paul, in which I lay it all out in clear
> terms. But let me respond to your response to Paul:
> 
> > The problem is:
> >
> > We need to terminate the tunnel agent (ssh client) by sending sigterm.
> 
> Not on windows. There is no ssh connection pooling.

Are you sure? OpenSSH supports it. As far as I know people can use it
on Windows e.g. with Cygwin.

> > By the way, Bob, maybe you are running a version of sshd affected by that
> > bug? If so, could you try updating sshd and see if that solves the problem,
> > without your patch applied?
> 
> I don't know. The sshd is on a remote service (A2 Hosting). And I
> don't care, because the problem is the harsh killing of the
> local/client tunnel agent (e.g. PLink). That's what starts it all.

If you are testing this patch against a remote hosting service, how can
you be sure that nothing bad happens on the server side without killing
the client?

I can live with the idea that simply closing the file descriptor should
terminate the ssh client, and it in turn should close the TCP socket.
The whole machinery should exit gracefully if this is done. On any OS.

But the fact that we have been sending APR_KILL_ALWAYS to the ssh
client *for years* seems to indicate that there is a problem with
not sending this signal. Which problem it might be is uncertain,
all information I have is this comment:

   * Closing the pipes and waiting for the process to die
   * was prone to mysterious hangs which are difficult to
   * diagnose (e.g. svnserve dumps core due to unrelated bug;
   * sshd goes into zombie state; ssh connection is never
   * closed; ssh never terminates).

I'd rather be careful when changing the default to something that
is known to have caused problems in the past.

But who knows, maybe the problems people were seeing back then were
phantoms and we can switch to APR_KILL_NEVER on any platform without
causing trouble.

> Did you read this:
> 
> http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2410204
> 
> If you instantly kill the tunnel agent, EVERYTHING runs through error
> paths, and depending on timing (client CPU and network latency) the
> remote sshd and its svnserve child can be left hanging.

Yes, that's bad. And we don't want to instantly kill the tunnel agent.
We don't do it on UNIX anymore (since 1.6.5). When I made the change I
was unaware of the fact that the situation on windows didn't change
because APR does not (or cannot) handle SIGTERM on windows.
But from reading the APR code it seems there is no difference at all
between APR_KILL_ONLY_ONCE and APR_KILL_ALWAYS on windows. So the
problem you are trying to fix should have existed pre-1.6.5. Can you
confirm this, if only to help me make sure I've understood the problem?

It seems we have no other choice on windows than to just close the fd
and hope for the best. What I'd like to be 100% certain about is that
not killing the client will not cause any problems on the server side.
If it does, we need to consider this too and maybe amend or extend your
solution. The comment above hints at server-side problems not killing
the client might cause ("sshd goes into zombie state").

Since I cannot reproduce any of this and have very little experience
with windows in general I cannot make an informed decision.
Rather instead of me second-guessing what's best for windows, I'd like
one of our Windows developers (e.g. Bert or Paul) to take a look at
this problem.

> I'm reaching the end of my limits trying to explain a problem from the
> standpoint of a Windows developer to this group who are clearly rooted
> in the Linux world.

Please stop saying such things. It is not the reason why digesting your
patch takes a long time. The problem is complicated and as it stands we
need more information to make an informed decision, that's all.

Repeatedly telling people in the open source community they were intolerant 
to your choice of OS is bad form and will cause the dividing effect you are
trying to avoid, but not because people don't agree with your choice of OS.

> I can see absolutely no reason to "terminate the tunnel agent" on
> Windows. At least not the tunnel agents PLink.exe, TortoisePLink.exe,
> and an ssh.exe I got from "somehwere". All exit gracefully when used
> (at least) from svn.exe, from TortoiseSVN, from SVN for Dreamweaver
> (DW GUI plugin), and from VisualSVN (a Visual Studio GUI plugin).
> Furthermore, they run as children of the SVN program or GUI plugin
> anyway, and if that exits, Windows takes out all children. 

OK, I believe that, and that's quite an amount of client coverage
in your testing which is very good. Any problems at the server's end?

> As it is now, subversion 1.6.6 is unusable by me.

That's bad and we need it fixed ASAP.

> My patched version
> is running nicely, no problems with my tunnels.

That's a good indicator that you're going in the right direction.

Stefan

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2411017

RE: Re: Issue #2580 revisited: Windows unclean TCP close [SEVERE]

Posted by Bob Denny <rd...@dc3.com>.
Stefan --

I just found this up in the thread tree, and I apologize for not replying sooner. This discussion thread has become confusing. I wonder if you read my replies to Paul, in which I lay it all out in clear terms. But let me respond to your response to Paul:

> The problem is:
>
> We need to terminate the tunnel agent (ssh client) by sending sigterm.

Not on windows. There is no ssh connection pooling. There is nothing complicated. The tunnel is a simple subprocess. There is no fork/exec on Windows. 

> By the way, Bob, maybe you are running a version of sshd affected by that
> bug? If so, could you try updating sshd and see if that solves the problem,
> without your patch applied?

I don't know. The sshd is on a remote service (A2 Hosting). And I don't care, because the problem is the harsh killing of the local/client tunnel agent (e.g. PLink). That's what starts it all.

Did you read this:

http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2410204

If you instantly kill the tunnel agent, EVERYTHING runs through error paths, and depending on timing (client CPU and network latency) the remote sshd and its svnserve child can be left hanging.

I'm reaching the end of my limits trying to explain a problem from the standpoint of a Windows developer to this group who are clearly rooted in the Linux world. I have nothing against Linux, I run several flavors here and do some development on it as well as supporting some astronomy customers who use it. If you go back and read my posts, you should understand what is going on. It is not my opinion, I only started out here with facts (see my first post for an example).

I can see absolutely no reason to "terminate the tunnel agent" on Windows. At least not the tunnel agents PLink.exe, TortoisePLink.exe, and an ssh.exe I got from "somehwere". All exit gracefully when used (at least) from svn.exe, from TortoiseSVN, from SVN for Dreamweaver (DW GUI plugin), and from VisualSVN (a Visual Studio GUI plugin). Furthermore, they run as children of the SVN program or GUI plugin anyway, and if that exits, Windows takes out all children. 

We can argue about theory, but nowhere do I see the need to forcibly/instantly terminate the tunnel on Windows. And if we DO terminate it, a chain of events takes place that I have amply explained in other messages here, all of which are ugly error paths. As it is now, subversion 1.6.6 is unusable by me. My patched version is running nicely, no problems with my tunnels.

With this message I've risked being annoying or worse, so I'm going to leave this issue for you all to do whatever you decide. I want to thank you all for subversion, it is a fantastic tool. I'm sorry I couldn't contribute in my small way from my small corner of the world.

  -- Bob

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2410821

RE: off mailing list : Issue #2580 revisited: Windows unclean TCP close [SEVERE]

Posted by Bob Denny <rd...@dc3.com>.
Pul --

Very enlightening! Thanks for the clear explanation of the design issue. I'm unfamiliar with UML modeling (a hole in my capabilities for sure), yet I understand what you're saying, at least at a conceptual level.

I still stand by my pragmatic approach to the problem for a Windows client. In practice, the wind-down proceeds properly and no processes are left running at either end.

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2411192

RE: off mailing list : Issue #2580 revisited: Windows unclean TCP close [SEVERE]

Posted by Paul Charlton <te...@byiq.com>.
Bob ...

Therein lies the fundamental problem:
At an application layer of the OSI model, SVNSERVE is expecting "socket"
semantics (using one system file descriptor), whereas when it is using SSHD,
it is actually using "terminal client" semantics (with two system file
descriptors) ... the two semantics are not quite the same when it comes to
control flows or object lifetime management.

On the other hand, if the SSH were used to encapsulate a formal socket (or
even "pipe") in a VPN style tunnel, then there would be no issue.

This whole question is a fundamental design problem, not a band-aid
situation.  It becomes really clear once you try to start some UML sequence
modeling of the objects being used on each end of the SVNCLIENT to SVNSERVE
session.

-Paul

Ps: but still raises the pragmatic issue of cost of implementing a
corner-case vs cost of support given the probability of occurrence.

> -----Original Message-----
> From: Stefan Sperling [mailto:stsp@elego.de]
> Sent: Sunday, October 25, 2009 4:36 AM
> To: Bob Denny
> Cc: Paul Charlton
> Subject: Re: off mailing list : Issue #2580 revisited: Windows unclean
> TCP close [SEVERE]
> 
> On Sat, Oct 24, 2009 at 11:26:32PM -0700, Bob Denny wrote:
> > Paul and Stefan --
> >
> > Paul, I appreciate your comments and understand them. But both of you
> seem to be
> > (at least as I read it) looking at the situation from the bottom up
> (from the
> > TCP viewpoint). Look from the top down - from the SVN protocol:
> >
> > Let's take a simple example an svn client and an svnserve server
> connected by a
> > wet noodle. Over this noodle flows the svn protocol, as defined here
> >
> >
> http://svn.collab.net/repos/svn/trunk/subversion/libsvn_ra_svn/protocol
> >
> > How does the svnserve know when the svn client is finished with it
> (and that it
> > should exit)? There is noting in the svn protocol for this, so the
> server
> > depends on getting something like an EOF over the noodle.
> >
> > Replace the noodle with an SSH tunnel. svn and svnserve could care
> less that it
> > is SSH. The svn client closes its connection, the EOF flows through
> the tunnel,
> > pops out the other end, the svnserve sees it and knows it's time to
> exit.
> >
> > Now SSH _does_ have a closing protocol! It consists of:
> >
> > 1. The client end sends SSH2_MSG_CHANNEL_EOF ("I'm finished")
> > 2. The server end replies in kind with SSH2_MSG_CHANNEL_EOF (My
> svnserve exited")
> > 3. The server end follows with SSH2_MSG_CHANNEL_REQUEST (exit-status)
> > containing the exit status of the svnserve.
> > 4. The client end sends SSH2_MSG_CHANNEL_CLOSE
> > 5. The server end sends SSH2_MSG_CHANNEL_CLOSE
> >
> > FINALLY, at this point, each end is responsible for closing down the
> TCP
> > connection. Only now does the FIN/FIN exchange take place. Normally
> the client
> > end, the end that initiated the TCP connection, should close that
> connection by
> > sending the first FIN.
> >
> > Now let's look at what happens when the client svn kills its tunnel
> agent as
> > soon as it receives the last of the data it is expecting from the
> svnserve
> > server: None of the 5 steps above ever happen! The SSH layer client-
> server
> > protocol is instantly stopped, leaving the remote sshd wondering what
> the h*ll
> > happened, and its child svnserve waiting for something else to do.
> How long
> > should they wait? Should the sshd give up immediately, kill its child
> svnserve,
> > and exit? It could be argued that it should wait for another incoming
> connection
> > so they can do more work.
> >
> > The server end's TCP stack may be in the middle of ACKing some past
> packet and
> > that will be met at the client with a RST because there's no more
> socket, it's
> > owner was killed. But that's basically irrelevant. The real problem
> is that the
> > SSH protocol never got a chance to relay the fact that the client svn
> was done,
> > letting the svnserve gracefully exit, and the two tunnel agents never
> got a
> > chance to wind down their protocol either.
> 
> That's a nice description of how it should work in theory.
> It would be nice and simple if closing the pipe to the ssh
> client was enough for things to settle down.
> In practice, not killing the tunnel agent didn't work out,
> at least when using svn+ssh and openssh connection pooling together.
> You end up with lots of ssh client processes idling about on the
> client machine (on UNIX).
> See http://subversion.tigris.org/issues/show_bug.cgi?id=2580#desc13
> 
> Stefan

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2411166

Re: Issue #2580 revisited: Windows unclean TCP close [SEVERE]

Posted by Stefan Sperling <st...@elego.de>.
On Thu, Oct 22, 2009 at 06:52:11AM -0700, Paul Charlton wrote:
> Fix SVNSERVE to respond correctly to the handshake it is receiving from
> sshd. I am not currently set up to do the test, but anyone who is running
> SSHD+SVNSERVE could run their SSHD in DEBUG mode and see what the handshake
> is and where SVNSERVE is dropping the ball.

What handshake are you referring to?
svnserve is complete agnostic about the ssh protocol.
It's just a tunnel agent, something that gives it a file descriptor
to read data and write data. It assumes it can send a sigterm to
the tunnel agent to tell it to quit the connection.

> > If Bob's patch fixes the problem without causing undesired side-
> > effects,
> > then that's fine by me.
> 
> This problem with SVNSERVE can also manifest itself without the windows
> client, and Bob's changes will not fix the other sources of the problem (for
> example, when someone kills the client from the Windows Task Manager).
> Fixing SVNSERVE will eliminate the whole class of problems.
> 
> More specifically, there are routers and NAT devices out there which also
> can end the session with a TCP RST to the server (instead of FIN/ACK), I am
> pretty sure my old NetGear NAT would do that even under legitimate
> circumstance, and the RFC does permit it.

Yes, connections can be severed for many reasons, but that's beside
the point here.

The problem is:

We need to terminate the tunnel agent (ssh client) by sending sigterm.
We cannot use sigkill because that does not work with ssh connection
pooling. Any ssh connection pooling master we kill with sigterm will
leave behind debris that prevents a new master from being spawned.
See issue #2580.

But sending a sigterm does not work on windows (either because Windows
or APR doesn't support it, I don't understand which). So Bob is
proposing we send sigkill again on windows only, possibly re-creating
the ssh connection pooling problem for windows users.

Note also that there is a known problem with OpenSSH versions < 5.1,
which can manifest itself in hanging servers/clients. Again, see issue
#2580 for details.
By the way, Bob, maybe you are running a version of sshd affected by that
bug? If so, could you try updating sshd and see if that solves the problem,
without your patch applied?

Also, I'd like to say this:
Everyone, please note that "how to terminate svn+ssh" is an issue that
keeps coming back haunting us. It is quite complicated to get this right,
and there are many components to point fingers at when looking at related
problems. Please refrain from speculating what the cause might be, but look,
and present data to proof your claims. Anything else is speculation, and
causes long discussion with no useful outcome. So it does not help much,
or worse, it might mislead people trying to fix the problem.

Thanks,
Stefan

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2410451

RE: Issue #2580 revisited: Windows unclean TCP close [SEVERE]

Posted by Paul Charlton <te...@byiq.com>.
> -----Original Message-----
> From: Stefan Sperling [mailto:stsp@elego.de]
> Sent: Thursday, October 22, 2009 1:51 AM
> To: Paul Charlton
> Cc: 'Bob Denny'; dev@subversion.tigris.org
> Subject: Re: Issue #2580 revisited: Windows unclean TCP close [SEVERE]
> 
> On Wed, Oct 21, 2009 at 10:33:32PM -0700, Paul Charlton wrote:
> > I did read this entire thread ... the problem is the linux sshd/child
> > process, per http://tools.ietf.org/html/rfc1122#page-87, see section
> > 4.2.2.13.
> > Just for kicks, also see
> >
> http://en.wikipedia.org/wiki/Transmission_Control_Protocol#Connection_t
> ermin
> > ation
> >
> > Under this situation, I would not advocate changing the windows
> > implementation to "correct" the mis-behavior of the linux sshd and
> its child
> > process.
> 
> What would you suggest doing instead?

Fix SVNSERVE to respond correctly to the handshake it is receiving from
sshd.  I am not currently set up to do the test, but anyone who is running
SSHD+SVNSERVE could run their SSHD in DEBUG mode and see what the handshake
is and where SVNSERVE is dropping the ball.

> 
> If Bob's patch fixes the problem without causing undesired side-
> effects,
> then that's fine by me.

This problem with SVNSERVE can also manifest itself without the windows
client, and Bob's changes will not fix the other sources of the problem (for
example, when someone kills the client from the Windows Task Manager).
Fixing SVNSERVE will eliminate the whole class of problems.

More specifically, there are routers and NAT devices out there which also
can end the session with a TCP RST to the server (instead of FIN/ACK), I am
pretty sure my old NetGear NAT would do that even under legitimate
circumstance, and the RFC does permit it.

> 
> (I'm not going to commit the patch myself simply because I cannot test
> it since I don't have a Windows machine. And I'd rather see the effects
> of what I commit with my own eyes.)
> 
> > Also, to quote your original post -- "sshd goes into bozo mode and
> leaves
> > its svnserve subprocess running."  I haven't looked into the sources,
> but if
> > I recall correctly, sshd under this circumstance is sending "SIGHUP"
> to the
> > child process (svnserve), giving it the opportunity to flush buffers
> ...
> > which would mean that svnserve is not responding correctly to
> external
> > signals.
> 
> The default action for SIGHUP is terminating the process.
> I don't see anything in the svnserve source code overriding SIGHUP.
> 
> Stefan
> 
> ------------------------------------------------------
> http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageI
> d=2410122

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2410229

Re: Issue #2580 revisited: Windows unclean TCP close [SEVERE]

Posted by Stefan Sperling <st...@elego.de>.
On Wed, Oct 21, 2009 at 10:33:32PM -0700, Paul Charlton wrote:
> I did read this entire thread ... the problem is the linux sshd/child
> process, per http://tools.ietf.org/html/rfc1122#page-87, see section
> 4.2.2.13.
> Just for kicks, also see
> http://en.wikipedia.org/wiki/Transmission_Control_Protocol#Connection_termin
> ation
> 
> Under this situation, I would not advocate changing the windows
> implementation to "correct" the mis-behavior of the linux sshd and its child
> process.

What would you suggest doing instead?

If Bob's patch fixes the problem without causing undesired side-effects,
then that's fine by me.

(I'm not going to commit the patch myself simply because I cannot test
it since I don't have a Windows machine. And I'd rather see the effects
of what I commit with my own eyes.)

> Also, to quote your original post -- "sshd goes into bozo mode and leaves
> its svnserve subprocess running."  I haven't looked into the sources, but if
> I recall correctly, sshd under this circumstance is sending "SIGHUP" to the
> child process (svnserve), giving it the opportunity to flush buffers ...
> which would mean that svnserve is not responding correctly to external
> signals.

The default action for SIGHUP is terminating the process.
I don't see anything in the svnserve source code overriding SIGHUP.

Stefan

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2410122

RE: Issue #2580 revisited: Windows unclean TCP close [SEVERE]

Posted by Paul Charlton <te...@byiq.com>.
I did read this entire thread ... the problem is the linux sshd/child
process, per http://tools.ietf.org/html/rfc1122#page-87, see section
4.2.2.13.
Just for kicks, also see
http://en.wikipedia.org/wiki/Transmission_Control_Protocol#Connection_termin
ation

Under this situation, I would not advocate changing the windows
implementation to "correct" the mis-behavior of the linux sshd and its child
process.

Also, to quote your original post -- "sshd goes into bozo mode and leaves
its svnserve subprocess running."  I haven't looked into the sources, but if
I recall correctly, sshd under this circumstance is sending "SIGHUP" to the
child process (svnserve), giving it the opportunity to flush buffers ...
which would mean that svnserve is not responding correctly to external
signals.

-Paul

-----Original Message-----
From: Bob Denny [mailto:rdenny@dc3.com] 
Sent: Thursday, October 15, 2009 9:38 AM
To: dev@subversion.tigris.org
Subject: Issue #2580 revisited: Windows unclean TCP close [SEVERE]

SVN 1.6.5 on Windows and using svn+ssh fails to properly complete and close
the TCP connection. leaving the remote sshd/svnserve processes running
(CentOS Linux server). These last "forever", accumulating until the sysop
disables your SSH access :-) This is a severe problem. It doesn't always
appear on slower CPUs with fast connections to the remote svn+ssh repo.
However, my system is a 2.6GHz quadcore, and the remote repo is across the
public Internet. This is the worst case environment as you'll see. I am
certain this problem is not unique to me.

I'm looking for a "buddy" on this. I am a TortoiseSVN user. TortoiseSVN
shares much code with Subversion. I have checked out the latest TortoiseSVN
trunk (which includes Subversion 1.6.5 as an external) and built it. I hope
I won't have to go through this again for the Subversion sources in order to
achieve the credibility I need here...

It is definitely an issue with Subversion, specifically libsvn_ra_svn and
its usage of apr. As of SVN 1.6.5, the svn client forcibly kills the tunnel
subprocess as soon as it receives the last of the data that it expects. This
prevents the tunnel proc from completing its conversation with the remote
sshd, leaving it and its child svnserve haning there for up to a half hour.
In my case the tunnel is TortoisePLink, but it also happens with the OpenSSH
'ssh.exe' tunnel, for the same reason.

Specifically, in libsvn_ra_svn\client.c, a call is made

apr_pool_note_subprocess(pool, proc, APR_KILL_ONLY_ONCE);

Inspection of the apr sources reveals that, on Windows, the only significant
kill modes are APR_KILL_ALWAYS and APR_KILL_NEVER. Any other modes (e.g.
APR_KILL_ONCE) are translated on Windows to APR_KILL_ALWAYS, resulting in
the svn client doing an immediate TerminateProcess() on the tunnel,
preventing it from cleanly closing the TCP connection.

Here is the tail of a packet trace showing the unclean close:

43 14.439453 svn.dc3.com bob.dc3.com TCP:Flags=...AP..., SrcPort=7822 ...
44 14.439453 bob.dc3.com svn.dc3.com TCP:Flags=...AP..., SrcPort=4206 ...
45 14.529297 svn.dc3.com bob.dc3.com TCP:Flags=...AP..., SrcPort=7822 ...
46 14.529297 bob.dc3.com svn.dc3.com TCP:Flags=...AP..., SrcPort=4206 ...
47 14.611328 svn.dc3.com bob.dc3.com TCP:Flags=...AP..., SrcPort=7822 ...
48 14.613281 bob.dc3.com svn.dc3.com TCP:Flags=...AP..., SrcPort=4206 ...
49 14.693359 svn.dc3.com bob.dc3.com TCP:Flags=...AP..., SrcPort=7822 ...
50 14.714844 bob.dc3.com svn.dc3.com TCP:Flags=...A.R.., SrcPort=4206 ...

The 'R' flag means a reset packet. The remote sshd tried to send an ACK to
me, and my tunnel was killed, so my TCP stack sent back this reset packet
saying "forget it, this TCP connection is gone". sshd goes into bozo mode
and leaves its svnserve subprocess running.

A tail of the tunnel packet log (from an 'svn ls' command) looks like this:

  00000210  29 20 28 20 37 3a 69 6e 73 74 72 65 67 20 64 69  ) ( 7:instreg
di
  00000220  72 20 30 20 66 61 6c 73 65 20 30 20 28 20 32 37  r 0 false 0 (
27
  00000230  3a 31 39 37 30 2d 30 31 2d 30 31 54 30 30 3a 30
:1970-01-01T00:0
  00000240  30 3a 30 30 2e 30 30 30 30 30 30 5a 20 29 20 28  0:00.000000Z )
(
  00000250  20 29 20 29 20 28 20 37 3a 63 6c 61 70 61 63 6b   ) ) (
7:clapack
  00000260  20 64 69 72 20 30 20 66 61 6c 73 65 20 30 20 28   dir 0 false 0
(
  00000270  20 32 37 3a 31 39 37 30 2d 30 31 2d 30 31 54 30
27:1970-01-01T0
  00000280  30 3a 30 30 3a 30 30 2e 30 30 30 30 30 30 5a 20  0:00:00.000000Z

  00000290  29 20 28 20 29 20 29 20 29 20 29 20 29 20        ) ( ) ) ) ) ) 
Outgoing packet type 96 / 0x60 (SSH2_MSG_CHANNEL_EOF)
  00000000  00 00 00 00                                      ....

I changed the line of code in client.c to

apr_pool_note_subprocess(pool, proc, APR_KILL_NEVER);

and the problem is cured. Here is the tail of a packet trace showing the
proper TCP close:

42 13.656250  svn.dc3.com bob.dc3.com TCP:Flags=...AP..., SrcPort=7822 ...
43 13.656250  bob.dc3.com svn.dc3.com TCP:Flags=...AP..., SrcPort=4197 ...
44 13.734375  svn.dc3.com bob.dc3.com TCP:Flags=...AP..., SrcPort=7822 ...
45 13.734375  bob.dc3.com svn.dc3.com TCP:Flags=...AP..., SrcPort=4197 ...
46 13.812500  svn.dc3.com bob.dc3.com TCP:Flags=...AP..., SrcPort=7822 ...
47 13.828125  bob.dc3.com svn.dc3.com TCP:Flags=...AP..., SrcPort=4197 ...
48 13.906250  svn.dc3.com bob.dc3.com TCP:Flags=...AP..., SrcPort=7822 ...
49 13.906250  bob.dc3.com svn.dc3.com TCP:Flags=...AP..., SrcPort=4197 ...
50 13.906250  bob.dc3.com svn.dc3.com TCP:Flags=...A...F, SrcPort=4197 ...
51 13.984375  svn.dc3.com bob.dc3.com TCP:Flags=...A...., SrcPort=7822 ...
52 13.984375  svn.dc3.com bob.dc3.com TCP:Flags=...A...., SrcPort=7822 ...
53 13.984375  svn.dc3.com bob.dc3.com TCP:Flags=...A...F, SrcPort=7822 ...
54 13.984375  bob.dc3.com svn.dc3.com TCP:Flags=...A...., SrcPort=4197 ...

You can see the tunnel and the remote sshd exchanging FIN packets. And here
is the tail of the tunnel packet log for the svn ls command:

  00000210  29 20 28 20 37 3a 69 6e 73 74 72 65 67 20 64 69  ) ( 7:instreg
di
  00000220  72 20 30 20 66 61 6c 73 65 20 30 20 28 20 32 37  r 0 false 0 (
27
  00000230  3a 31 39 37 30 2d 30 31 2d 30 31 54 30 30 3a 30
:1970-01-01T00:0
  00000240  30 3a 30 30 2e 30 30 30 30 30 30 5a 20 29 20 28  0:00.000000Z )
(
  00000250  20 29 20 29 20 28 20 37 3a 63 6c 61 70 61 63 6b   ) ) (
7:clapack
  00000260  20 64 69 72 20 30 20 66 61 6c 73 65 20 30 20 28   dir 0 false 0
(
  00000270  20 32 37 3a 31 39 37 30 2d 30 31 2d 30 31 54 30
27:1970-01-01T0
  00000280  30 3a 30 30 3a 30 30 2e 30 30 30 30 30 30 5a 20  0:00:00.000000Z

  00000290  29 20 28 20 29 20 29 20 29 20 29 20 29 20        ) ( ) ) ) ) ) 
Outgoing packet type 96 / 0x60 (SSH2_MSG_CHANNEL_EOF)
  00000000  00 00 00 00                                      ....
Event Log: Sent EOF message
Incoming packet type 96 / 0x60 (SSH2_MSG_CHANNEL_EOF)
  00000000  00 00 01 00                                      ....
Incoming packet type 98 / 0x62 (SSH2_MSG_CHANNEL_REQUEST)
  00000000  00 00 01 00 00 00 00 0b 65 78 69 74 2d 73 74 61
........exit-sta
  00000010  74 75 73 00 00 00 00 00                          tus.....
Event Log: Server sent command exit status 0
Incoming packet type 97 / 0x61 (SSH2_MSG_CHANNEL_CLOSE)
  00000000  00 00 01 00                                      ....
Outgoing packet type 97 / 0x61 (SSH2_MSG_CHANNEL_CLOSE)
  00000000  00 00 00 00                                      ....
Event Log: Disconnected: All channels closed

You can see that the tunnel was allowed to complete its activity and exit
through its normal paths. The remote sshd/svnserve processes exit cleanly as
well. Problem solved!

RECOMMENDATIONS:

The cleanest and safest way to handle this would seem to be:

1. Modify apr to support the kill mode APR_KILL_AFTER_TIMEOUT *on Windows*.
This would cause the tunnel to be killed after three seconds, presumably
plenty of time. 

2. Modify libsvn_ra_svn\client.c to use APR_KILL_AFTER_TIMEOUT *on Windows*
instead of APR_KILL_ONCE.

However, patching apr is probably not acceptable, so instead:

1. Modify libsvn_ra_svn\client.c to use APR_KILL_NEVER *on Windows* instead
of APR_KILL_ONCE.

What do you think?

  -- Bob Denny

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=240
7949

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2410074

Re: Issue #2580 revisited: Windows unclean TCP close [SEVERE]

Posted by Branko Cibej <br...@xbc.nu>.
Bob Denny wrote:
> The problem is that I would be patching part of the  Apache Portable Runtime,
> something I have zero real-world experience with. There's no SIGTERM (as
> observed above) so my patch would be to just wait three seconds then kill. This
> would probably be met with ridicule by the webserver people who are used to
> their server spawning hundreds of subprocesses per millisecond :-) I believe
> that I would be frozen out of the Apache group for a patch especially since my
> patch would be Windows oriented and that would tag me as "one of those Windoze
> guys" :-) That's why I observed that it probably isn't practical.
>   

Interesting point of view. I expect you didn't know that the "boss" of
the APR project happens to be "one of those Windoze guys" at the moment. :)

Rest assured, the people working on APR won't respond with a broadside
to an imperfect Windows-specific patch. We're not that ilk of petty L33+
H4x0rs; in fact we mostly don't want to have anything to do with L33+
H4x0rs. :) Contributions are always welcome.

>>> 1. Modify libsvn_ra_svn\client.c to use APR_KILL_NEVER *on Windows*
>>> instead of APR_KILL_ONCE.
>>>
>>> What do you think?
>>>       
>
>   
>> KILL_NEVER is not a good solution. If it was, it would already be done
>> this way, and we'd never have had to mess around with killing SSH
>> processes in the first place.
>>     
>
> I don't think this would present a problem on windows, and that is why I
> suggested ONLY doing this on Windows. I'm aware of the issues on Linux,
> though, and why you decided to do it to tunnel subprocesses.
>
> So it seems we're stuck: On one hand, we kill tunnels instantly, leaving
> sshd/svnserve processes running and accumulating on the remote end. This
> certainly happens. On the other hand, we don't kill tunnels and they don't exit
> cleanly and accumulate on the local system. It's less clear to me whether this
> is a rare or a frequent problem. I wonder how often this happens any more? Is
> this an ancient problem for which the solution has been in the code for a long
> time? Or is this a problem on Linux and not Windows? Maybe it doesn't exist on
> modern versions of Linux? Apart from patching apr, my feeling is that KILL_NEVER
> *ON WINDOWS ONLY* is the lesser of two evils (by a lot). KILL_ONCE (SIGTERM, 3
> sec., SIGKILL) is the best for Linux, etc.,and that's what you did in 1.6.5.
> Good one.
>   


I think we need some hard data in order to make a reasonable decision
here. Bob, since you see and presumably can reproduce this
indeiscriminate proliferation of daemons; could you possibly test your
assumption that KILL_NEVER does the right thing on Windows? It would
require modifying and recompiling Subversion.

-- Brane

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2408466

Re: Issue #2580 revisited: Windows unclean TCP close [SEVERE]

Posted by Greg Hudson <gh...@mit.edu>.
On Wed, 2009-10-21 at 21:04 -0400, Joseph Galbraith wrote:
> > And yet terminating the TCP connection does not have the same effect on
> > the server process?  That is to say, when the client is killed,
> > shouldn't the OS be responsible for immediately closing the TCP
> > session?

> The problem is that the TCP stack may be unable to detect
> a connection drop (on either the client or the server) unless
> the application tries to send some data.  (This is independent
> of the OS; I believe it is an artifact of the way IP works.)

No, that's not accurate.  On Unix-like systems, if you kill the client
process, however mercilessly, the kernel will close its TCP connections
and send FIN packets to the server (retransmitting if necessary).

Like Peter, I would expect Windows to do the same--but like Peter, I
don't really know for sure.

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2410065

Re: Issue #2580 revisited: Windows unclean TCP close [SEVERE]

Posted by Joseph Galbraith <ga...@vandyke.com>.
Peter Samuelson wrote:
> [Bob Denny]
>> Yes, it does need improvement :-) He is correct in his thoughts about
>> subprocesses on windows. Unless the subprocess is poorly written,
>> closing the input handle will make the subprocess ultimately go away.
> 
> And yet terminating the TCP connection does not have the same effect on
> the server process?  That is to say, when the client is killed,
> shouldn't the OS be responsible for immediately closing the TCP
> session?  Are you saying your Windows kernel (I don't remember if you
> mentioned which version of Windows you have) does _not_ do this?  That
> sounds to me like a whole other bug - in that Windows kernel.

The problem is that the TCP stack may be unable to detect
a connection drop (on either the client or the server) unless
the application tries to send some data.  (This is independent
of the OS; I believe it is an artifact of the way IP works.)

I almost never see this in a LAN environment (the stack tends
to detect connection drops instantaneously), but if a router
is involved, it isn't unusual.

Turning keepalives on for the socket can help this problem,
but generally, keepalives don't kick in until the socket
has been idle for an hour.  (So the stack won't detect the
drop for an hour even with keep alives on.)

So if the ssh client doesn't get a chance to do a clean shutdown,
the server may not notice the connection drop for up to an hour.

Now, I don't swear that is what is going on in _this_ case... but
in the general case... this can be a problem.

Thanks,

Joseph

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2410027

RE: Issue #2580 revisited: Windows unclean TCP close [SEVERE]

Posted by Paul Charlton <te...@byiq.com>.
Peter,
It's not so much a bug but a *permitted* performance optimization for TCP
... the behavior keeps all in-between router's tables "hot" unless the
server side chooses to force-close its side of the TCP connection.  Think of
a modern "secure" browser which has a relatively long duration session with
a server consisting of multiple individual TCP sessions whose lifetimes
don't necessarily overlap, some of which are originating from child
processes of the browser for security reasons.  The half-duplex RST allows
the server to determine when the overall set of TCP sessions is done,
without triggering a router table flush which a FIN packet would allow.

The OSI layers do specifically allow for out-of-band control flow, unrelated
to read/write buffering status.

Best regards,
Paul

> -----Original Message-----
> From: Peter Samuelson [mailto:peter@p12n.org]
> Sent: Thursday, October 22, 2009 6:49 AM
> To: dev@subversion.tigris.org
> Cc: Bob Denny
> Subject: Re: Issue #2580 revisited: Windows unclean TCP close [SEVERE]
> 
> [Peter Samuelson]
> > That is to say, when the client is killed, shouldn't the OS be
> > responsible for immediately closing the TCP session?  Are you saying
> > your Windows kernel (I don't remember if you mentioned which version
> > of Windows you have) does _not_ do this?
> 
> I can confirm that Windows Server 2008 has this bug.  Using 'telnet',
> the Windows Task Manager thingy, and tcpdump on the server end, it
> seems that when you kill the client app, the OS makes _no_ effort to
> close the TCP connection.  I tested with Linux (kernel 2.6.30, but I
> doubt it matters) and 'kill -9', and the kernel sends a FIN packet, and
> ACKs the FIN it gets back from the server, exactly as you'd expect.
> 
> We still have no idea what Windows kernel Bob Denny is running, but
> presumably it has the same bug as Windows Server 2008.
> --
> Peter Samuelson | org-tld!p12n!peter | http://p12n.org/
> 
> ------------------------------------------------------
> http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageI
> d=2410225

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2410251

Re: Issue #2580 revisited: Windows unclean TCP close [SEVERE]

Posted by Peter Samuelson <pe...@p12n.org>.
[Peter Samuelson]
> That is to say, when the client is killed, shouldn't the OS be
> responsible for immediately closing the TCP session?  Are you saying
> your Windows kernel (I don't remember if you mentioned which version
> of Windows you have) does _not_ do this?

I can confirm that Windows Server 2008 has this bug.  Using 'telnet',
the Windows Task Manager thingy, and tcpdump on the server end, it
seems that when you kill the client app, the OS makes _no_ effort to
close the TCP connection.  I tested with Linux (kernel 2.6.30, but I
doubt it matters) and 'kill -9', and the kernel sends a FIN packet, and
ACKs the FIN it gets back from the server, exactly as you'd expect.

We still have no idea what Windows kernel Bob Denny is running, but
presumably it has the same bug as Windows Server 2008.
-- 
Peter Samuelson | org-tld!p12n!peter | http://p12n.org/

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2410225

Re: Issue #2580 revisited: Windows unclean TCP close [SEVERE]

Posted by Peter Samuelson <pe...@p12n.org>.
[Bob Denny]
> Yes, it does need improvement :-) He is correct in his thoughts about
> subprocesses on windows. Unless the subprocess is poorly written,
> closing the input handle will make the subprocess ultimately go away.

And yet terminating the TCP connection does not have the same effect on
the server process?  That is to say, when the client is killed,
shouldn't the OS be responsible for immediately closing the TCP
session?  Are you saying your Windows kernel (I don't remember if you
mentioned which version of Windows you have) does _not_ do this?  That
sounds to me like a whole other bug - in that Windows kernel.
-- 
Peter Samuelson | org-tld!p12n!peter | http://p12n.org/

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2409813

Re: Issue #2580 revisited: Windows unclean TCP close [SEVERE]

Posted by Branko Cibej <br...@xbc.nu>.
Bob Denny wrote:
> Hi Branko --
>
> Thanks for the reply! I have been pushed away from a couple of open source
> projects in the past, so I do appreciate you and Stefan listening :-)
>
> Branko:
>   
>> I think we need some hard data in order to make a reasonable decision
>> here. Bob, since you see and presumably can reproduce this
>> indiscriminate proliferation of daemons; could you possibly test your
>> assumption that KILL_NEVER does the right thing on Windows? It would
>> require modifying and recompiling Subversion.
>>     
>
> I've already done this with svn 1.6.5 and Linux based svn servers, and I'm
> running it here in production. Actually I checked out the TortoiseSVN tree which
> contains svn 1.6.5 and built THAT, but it results in both Tortoise as well as
> svn 1.6.5 *client side* and libraries. Both are working perfectly now.

OK; I think that's as much confirmation we need. I believe this code
hasn't changed since 1.6.5.

[...]
> If I need to make Subversion so I can test against a WINDOWS ssh/svnserve I
> will, but it will be some time as I am nearing the most important astronomy
> conference of the year for me.

No-one expects you to work miracles. :) Like I said, contributions are
always welcome; and yours are accompanied by more thorough research than
the average. Thanks!

>  It will be several weeks before I can try to make
> a Windows svnserve. But someone out there must have already have a Windows build environment and could quickly make the patch and test.
>   

Sure. (I just happen to not be one of those someones these days.)


So here's the situation as I see it:

    * Our current code works with Unix clients against Unix servers.
    * Bob's patch work with Windows clients against Unix servers, tested
      with the 2 (3?) relevant SSH clients for Windows.
    * I very much suspect that this issue is not affected by the server
      side, but think it would be nice if someone could test that.

Given all the above, and seeing as this doesn't appear to be something
easily remedied by an APR fix ... and assuming that the
Windows<->Windows test works out ... I'd propose we make the
Windows-specific change that Bob suggested.

-- Brane

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2408506

RE: Re: Issue #2580 revisited: Windows unclean TCP close [SEVERE]

Posted by Bob Denny <rd...@dc3.com>.
Well freeSSHd is not up to the task. It acts crazy with the "old" shell and the "new" shell insterface. Plus it does not fork new processes, so it isn't going to show the problem. Its connections are managed by threads within a single process.

I think we're going to have to go with the common case of Windows client, Linux/OSX server.

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2409255

Re: Issue #2580 revisited: Windows unclean TCP close [SEVERE]

Posted by Bob Denny <rd...@dc3.com>.
Branko --

I did some research and I don't think a windows-to-windows test is going to mean
much. The problem manifests itself over svn+ssh, and I need an sshd for windows.

http://www.openssh.com/windows.html

They're all clients, not receiver daemons. The only one I could find that looks
remotely usable is freeSSHD

http://www.freesshd.com/?ctt=overview

and it's not something like a forking daemon. It's not listed on the OpenSSH
page either (of course). OpenSSH for Windows, also not listed, says things like
"coming soon, new developer..." the last one was in 2004 and the guy abandoned it.

I think virtually everyone on Windows who is using svn+ssh is working with a
Linux back end like me. Without the patch, it is unusable on a fast system with
a modest net connection, as I explained previously.

KILL_ONCE makes sense on Linux where it actually works. KILL_NEVER is the best
of a bad set of choices on Windows. Thanks for the encouragement. I really did
put a lot of time into this, much of up climbing multiple learning curves :-))

  -- Bob

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2408617

Re: Issue #2580 revisited: Windows unclean TCP close [SEVERE]

Posted by Bob Denny <rd...@dc3.com>.
Hi Branko --

Thanks for the reply! I have been pushed away from a couple of open source
projects in the past, so I do appreciate you and Stefan listening :-)

Branko:
> I think we need some hard data in order to make a reasonable decision
> here. Bob, since you see and presumably can reproduce this
> indiscriminate proliferation of daemons; could you possibly test your
> assumption that KILL_NEVER does the right thing on Windows? It would
> require modifying and recompiling Subversion.

I've already done this with svn 1.6.5 and Linux based svn servers, and I'm
running it here in production. Actually I checked out the TortoiseSVN tree which
contains svn 1.6.5 and built THAT, but it results in both Tortoise as well as
svn 1.6.5 *client side* and libraries. Both are working perfectly now. Before
the patch, every single set of sshd/svnserve daemons that Tortoise created was
left running on the remote Linux machine. After a couple of moves around in the
Repo Browser, and a log or two, there's be 10 or more of those things that had
to be killed. Now, no problem.

I did check out the Subversion trunk from Tigris and start on building it. But
by then I had already invested over three days in this, and for Subversion
itself I needed even more tools and bits from elsewhere (though I found the
tarball of externals you guys provide), and I had real work piling up. I do
astronomical observatory control software, I'm just one guy trying to eek out a
living at this (for 10 years), and being near the new moon, my paying work is
calling me :-) That's why I am up at 0330! Anyway, I ran into several roadblocks
and gave up on making Subversion from the trunk for lack of time and exhaustion
at wrestling with maketools.

I am NOT complaining! I think you guys are amazing. I just had to get back to work.

Anyway... I analyzed the problem, found it's physical cause, located the source
of the cause, made a change that _should_ have worked, tested that, found it
works, and therefore I had a high degree of confidence in what I originally
wrote, before I wrote it. And I started from ground zero, complete lack of
knowledge of the structure and sources. Tnat's why it took me several days.

If I need to make Subversion so I can test against a WINDOWS ssh/svnserve I
will, but it will be some time as I am nearing the most important astronomy
conference of the year for me. It will be several weeks before I can try to make
a Windows svnserve. But someone out there must have already have a Windows build
environment and could quickly make the patch and test.

Anyway, thanks again...

  -- Bob

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2408497

Re: Issue #2580 revisited: Windows unclean TCP close [SEVERE]

Posted by Bob Denny <rd...@dc3.com>.
> Bob Denny:
> As of SVN 1.6.5, the svn client forcibly kills
> the tunnel subprocess as soon as it receives the last of the data that
> it expects.
>
> Stefan Sperling:
> This is not entirely correct. We have been killing the ssh client
> for a long time.

I'm sorry, what I meant is that I observed it in 1.6.5. I have never used older
versions.

> What's new in 1.6.5 is that we now send SIGTERM instead of SIGKILL,
> which allows an SSH process to clean up a master socket it might be
> managing for for SSH connection pooling.

This is true only for Linux. On Windows, inside APR, the APR_KILL_ONLY_ONCE is
turned into APR_KILL_ALWAYS, and that becomes an instant TerminateProcess()
call, which is like a SIGKILL but even harsher as it is not a signal, but the OS
instantly stops process execution.

> But it's *not* a new problem in 1.6.5.

Yes, I know.

> But [your solution] creates another problem which killing the subprocess is
> meant to cure. If long-lived Subversion clients (e.g. IDE integrations)
> using the SVN libraries open a connection via svn+ssh:// it's possible
> that the ssh subprocesses created never die. Over time, ssh processes
> accumulate on the client workstation, doing nothing but waiting to be
> killed.

Well, that's the flip side of this problem, where usage of the command line svn
does the same thing - "the ssh subprocesses created never die...".

> See the comment above the line you're quoting:
>
>   /* Arrange for the tunnel agent to get a SIGTERM on pool
>    * cleanup.  This is a little extreme, but the alternatives
>    * weren't working out.

Heh, and then there is the comment in apr function apr_proc_kill()
(threadproc/win32/signals.c):

  /* Windows only really support killing process, but that will do for now.
   *
   * ### Actually, closing the input handle to the proc should also do fine
   * for most console apps.  This definately needs improvement...
   */

Yes, it does need improvement :-)  He is correct in his thoughts about
subprocesses on windows. Unless the subprocess is poorly written, closing the
input handle will make the subprocess ultimately go away. But we can't protect
the world against every conceivable programming error! At least the tunnels I
have used (ssh, PuTTY PLink, and and TortoisePLink) all do it correctly, an  EOF
on the stdin handle results (eventually) in cleanup_exit() being called, and the
protocol between it and the remote sshd runs to proper completion. And anyway,
if a tunnel does NOT go away when its stdin is closed, it's a bug in the tunnel!

> Patching APR is certainly acceptable.
> Would you be willing to work on a patch of APR and submit it there?

The problem is that I would be patching part of the  Apache Portable Runtime,
something I have zero real-world experience with. There's no SIGTERM (as
observed above) so my patch would be to just wait three seconds then kill. This
would probably be met with ridicule by the webserver people who are used to
their server spawning hundreds of subprocesses per millisecond :-) I believe
that I would be frozen out of the Apache group for a patch especially since my
patch would be Windows oriented and that would tag me as "one of those Windoze
guys" :-) That's why I observed that it probably isn't practical.

>> 1. Modify libsvn_ra_svn\client.c to use APR_KILL_NEVER *on Windows*
>> instead of APR_KILL_ONCE.
>>
>> What do you think?

> KILL_NEVER is not a good solution. If it was, it would already be done
> this way, and we'd never have had to mess around with killing SSH
> processes in the first place.

I don't think this would present a problem on windows, and that is why I
suggested ONLY doing this on Windows. I'm aware of the issues on Linux,
though, and why you decided to do it to tunnel subprocesses.

So it seems we're stuck: On one hand, we kill tunnels instantly, leaving
sshd/svnserve processes running and accumulating on the remote end. This
certainly happens. On the other hand, we don't kill tunnels and they don't exit
cleanly and accumulate on the local system. It's less clear to me whether this
is a rare or a frequent problem. I wonder how often this happens any more? Is
this an ancient problem for which the solution has been in the code for a long
time? Or is this a problem on Linux and not Windows? Maybe it doesn't exist on
modern versions of Linux? Apart from patching apr, my feeling is that KILL_NEVER
*ON WINDOWS ONLY* is the lesser of two evils (by a lot). KILL_ONCE (SIGTERM, 3
sec., SIGKILL) is the best for Linux, etc.,and that's what you did in 1.6.5.
Good one.

  -- Bob

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2408379