You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@thrift.apache.org by Rush Manbert <ru...@manbert.com> on 2009/06/26 01:03:34 UTC
C++ Lib: TSocket and SO_LINGER strangeness
IN order to test all of the Asio based code I have written, I have
modified StressTest in various ways. One is to allow an --endless
command line argument that causes the part of the test that runs the
clients to be looped forever. This lets me start the server, then
hammer on it overnight.
Once I had this capability, I discovered that if I ran with the
parameters --clients=20 --loop=10 I could cause the test to fail
within a short amount of time. The failure mode is that a client task
catches a Transport exception when it calls one of the loopEchoXxx()
methods. The exception says that ECONNRESET was seen on a read.
I tested with various client and server configurations:
Client Server Notes Test Result
------ ------ ----------------- ------------
Mac Win XP No failures
WinXP Mac No failures
WinXp WinXP Loopback No failures
Mac Mac Loopback No failures
Mac Mac Non-loopback No failures
The code running on Windows XP is obviously Asio based, but these test
results occur regardless of whether the code running on the Mac is
Asio based (my new stuff) or the original code.
I did a lot of searching on the internet and one page (http://blog.netherlabs.nl/articles/2009/01/18/the-ultimate-so_linger-page-or-why-is-my-tcp-not-reliable
) made me start wondering about the linger settings for the sockets.
The TSocket constructor used by clients initializes the linger
variables like this:
lingerOn_(1),
lingerVal_(0),
The server socket always has linger turned off.
I used tcpdump to capture the packets transmitted when a failure
occurs. For this test I used --clients=2 --loop=1. Here is the trace
for the last 2 clients. I have broken each line of the dump
immediately after the "Src>Dst:" portion, so the flags show up in
column 2 of the next line:
13:59:11.372267 IP localhost.53127 > localhost.xmltec-xmlmail:
S 2349780390:2349780390(0) win 65535 <mss 16344,nop,wscale
1,nop,nop,timestamp 824548119 0,sackOK,eol>
E..@a.@.@.............#..............4....?........
1%..........
13:59:11.372270 IP localhost.53128 > localhost.xmltec-xmlmail:
S 2191687293:2191687293(0) win 65535 <mss 16344,nop,wscale
1,nop,nop,timestamp 824548119 0,sackOK,eol>
E..@.'@.@.............#...~}.........4....?........
1%..........
13:59:11.372304 IP localhost.xmltec-xmlmail > localhost.53127:
S 2047182235:2047182235(0) ack 2349780391 win 65535 <mss
16344,nop,wscale 2,nop,nop,timestamp 824548119 824548119,sackOK,eol>
E..@.}@.@...........#...z............4....?........
1%..1%......
13:59:11.372336 IP localhost.xmltec-xmlmail > localhost.53128:
S 994342982:994342982(0) ack 2191687294 win 65535 <mss
16344,nop,wscale 2,nop,nop,timestamp 824548119 824548119,sackOK,eol>
E..@M.@.@...........#...;DxF..~~.....4....?........
1%..1%......
13:59:11.372345 IP localhost.53127 > localhost.xmltec-xmlmail:
. ack 1 win 40830 <nop,nop,timestamp 824548119 824548119>
E..4..@.@.............#.....z......~.(.....
1%..1%..
13:59:11.372353 IP localhost.53128 > localhost.xmltec-xmlmail:
. ack 1 win 40830 <nop,nop,timestamp 824548119 824548119>
E..4!.@.@.............#...~~;DxG...~.(.....
1%..1%..
13:59:11.372372 IP localhost.xmltec-xmlmail > localhost.53127:
R 2047182236:2047182236(0) win 0
<<<<<<<<---------------------------------LOOK!!!!
E..(<-@.@...........#...z.......P.......
13:59:11.372377 IP localhost.53127 > localhost.xmltec-xmlmail:
P 1:32(31) ack 1 win 40830 <nop,nop,timestamp 824548119 824548119>
E..S.)@.@.............#.....z......~.G.....
1%..1%..........echoI64....
...........
13:59:11.372385 IP localhost.53128 > localhost.xmltec-xmlmail:
P 1:32(31) ack 1 win 40830 <nop,nop,timestamp 824548119 824548119>
E..S.I@.@.............#...~~;DxG...~.G.....
1%..1%..........echoI64....
...........
13:59:11.373086 IP localhost.53128 > localhost.xmltec-xmlmail:
R 32:32(0) ack 1 win 40830
E..(i?@.@.............#...~.;DxGP..~....
The line marked with "LOOK!!!!" is where the failure occurs. The
client and server have made their socket connections, but the server
sends a RST packet before the client sends his first data packet. I'm
using the TSimpleServer, and it almost looks as if the server tried to
read before the client wrote any data, and when there was nothing
there the server immediately closed the socket. But that would only
explain this trace if the server had written data to the socket that
had not yet been read by the client and if the server had linger on
with a timeout of zero.
However, I did these experiments in a different order, and because of
that I tried modifying the TSocket constructor so that the linger
variables were initialized like this:
lingerOn_(0),
lingerVal_(0),
which turns linger off for the client socket. After I made this
change, the loopback test on the Mac just runs and runs and runs with
no errors. I also tried having linger on, but with a 5 second timeout
and that also runs without error.
I'm really at a loss to explain this. The tcpdump output makes it
pretty clear that the server immediately sent RST. But if I just print
the port_ value in TSocket::close() and run until a failure occurs,
there is no evidence that the server side closed its socket. And if I
turn off linger on the client side everything works like a charm and I
see no failures. How can the client side settings affect the server
side behavior?
I'm hoping that one of you POSIX sockets guys will have some idea of
what's happening here. It looks like we can sidestep the problem by
setting linger off, or setting it on with some nonzero time if we're
running a client on a Mac, but I would rather understand what's going
on and fix it if possible, rather than hack around it.
- Rush
Re: C++ Lib: TSocket and SO_LINGER strangeness
Posted by Rush Manbert <ru...@manbert.com>.
Geez, cut and paste errors in the original post. Fixed here, I
sincerely hope.
On Jun 25, 2009, at 4:03 PM, Rush Manbert wrote:
> IN order to test all of the Asio based code I have written, I have
> modified StressTest in various ways. One is to allow an --endless
> command line argument that causes the part of the test that runs the
> clients to be looped forever. This lets me start the server, then
> hammer on it overnight.
>
> Once I had this capability, I discovered that if I ran with the
> parameters --clients=20 --loop=10 I could cause the test to fail
> within a short amount of time. The failure mode is that a client
> task catches a Transport exception when it calls one of the
> loopEchoXxx() methods. The exception says that ECONNRESET was seen
> on a read.
>
> I tested with various client and server configurations:
> Client Server Notes Test Result
> ------ ------ ----------------- ------------
> Mac Win XP No failures
> WinXP Mac No failures
> WinXp WinXP Loopback No failures
<snip>
> Mac Mac Loopback Fails
> Mac Mac Non-loopback Fails
i.e. only the Mac to Mac tets fail.
>
> The code running on Windows XP is obviously Asio based, but these
> test results occur regardless of whether the code running on the Mac
> is Asio based (my new stuff) or the original code.
>
> I did a lot of searching on the internet and one page (http://blog.netherlabs.nl/articles/2009/01/18/the-ultimate-so_linger-page-or-why-is-my-tcp-not-reliable
> ) made me start wondering about the linger settings for the sockets.
> The TSocket constructor used by clients initializes the linger
> variables like this:
> lingerOn_(1),
> lingerVal_(0),
>
> The server socket always has linger turned off.
>
> I used tcpdump to capture the packets transmitted when a failure
> occurs. For this test I used --clients=2 --loop=1. Here is the trace
> for the last 2 clients. I have broken each line of the dump
> immediately after the "Src>Dst:" portion, so the flags show up in
> column 2 of the next line:
> 13:59:11.372267 IP localhost.53127 > localhost.xmltec-xmlmail:
> S 2349780390:2349780390(0) win 65535 <mss 16344,nop,wscale
> 1,nop,nop,timestamp 824548119 0,sackOK,eol>
> E..@a.@.@.............#..............4....?........
> 1%..........
> 13:59:11.372270 IP localhost.53128 > localhost.xmltec-xmlmail:
> S 2191687293:2191687293(0) win 65535 <mss 16344,nop,wscale
> 1,nop,nop,timestamp 824548119 0,sackOK,eol>
> E..@.'@.@.............#...~}.........4....?........
> 1%..........
> 13:59:11.372304 IP localhost.xmltec-xmlmail > localhost.53127:
> S 2047182235:2047182235(0) ack 2349780391 win 65535 <mss
> 16344,nop,wscale 2,nop,nop,timestamp 824548119 824548119,sackOK,eol>
> E..@.}@.@...........#...z............4....?........
> 1%..1%......
> 13:59:11.372336 IP localhost.xmltec-xmlmail > localhost.53128:
> S 994342982:994342982(0) ack 2191687294 win 65535 <mss
> 16344,nop,wscale 2,nop,nop,timestamp 824548119 824548119,sackOK,eol>
> E..@M.@.@...........#...;DxF..~~.....4....?........
> 1%..1%......
> 13:59:11.372345 IP localhost.53127 > localhost.xmltec-xmlmail:
> . ack 1 win 40830 <nop,nop,timestamp 824548119 824548119>
> E..4..@.@.............#.....z......~.(.....
> 1%..1%..
> 13:59:11.372353 IP localhost.53128 > localhost.xmltec-xmlmail:
> . ack 1 win 40830 <nop,nop,timestamp 824548119 824548119>
> E..4!.@.@.............#...~~;DxG...~.(.....
> 1%..1%..
> 13:59:11.372372 IP localhost.xmltec-xmlmail > localhost.53127:
> R 2047182236:2047182236(0) win 0
> <<<<<<<<---------------------------------LOOK!!!!
> E..(<-@.@...........#...z.......P.......
> 13:59:11.372377 IP localhost.53127 > localhost.xmltec-xmlmail:
> P 1:32(31) ack 1 win 40830 <nop,nop,timestamp 824548119 824548119>
> E..S.)@.@.............#.....z......~.G.....
> 1%..1%..........echoI64....
> ...........
> 13:59:11.372385 IP localhost.53128 > localhost.xmltec-xmlmail:
> P 1:32(31) ack 1 win 40830 <nop,nop,timestamp 824548119 824548119>
> E..S.I@.@.............#...~~;DxG...~.G.....
> 1%..1%..........echoI64....
> ...........
> 13:59:11.373086 IP localhost.53128 > localhost.xmltec-xmlmail:
> R 32:32(0) ack 1 win 40830
> E..(i?@.@.............#...~.;DxGP..~....
>
>
> The line marked with "LOOK!!!!" is where the failure occurs. The
> client and server have made their socket connections, but the server
> sends a RST packet before the client sends his first data packet.
> I'm using the TSimpleServer, and it almost looks as if the server
> tried to read before the client wrote any data, and when there was
> nothing there the server immediately closed the socket. But that
> would only explain this trace if the server had written data to the
> socket that had not yet been read by the client and if the server
> had linger on with a timeout of zero.
>
> However, I did these experiments in a different order, and because
> of that I tried modifying the TSocket constructor so that the linger
> variables were initialized like this:
> lingerOn_(0),
> lingerVal_(0),
>
> which turns linger off for the client socket. After I made this
> change, the loopback test on the Mac just runs and runs and runs
> with no errors. I also tried having linger on, but with a 5 second
> timeout and that also runs without error.
>
> I'm really at a loss to explain this. The tcpdump output makes it
> pretty clear that the server immediately sent RST. But if I just
> print the port_ value in TSocket::close() and run until a failure
> occurs, there is no evidence that the server side closed its socket.
> And if I turn off linger on the client side everything works like a
> charm and I see no failures. How can the client side settings affect
> the server side behavior?
>
> I'm hoping that one of you POSIX sockets guys will have some idea of
> what's happening here. It looks like we can sidestep the problem by
> setting linger off, or setting it on with some nonzero time if we're
> running a client on a Mac, but I would rather understand what's
> going on and fix it if possible, rather than hack around it.
>
> - Rush
Re: C++ Lib: TSocket and SO_LINGER strangeness
Posted by Eric Anderson <er...@hp.com>.
Rush Manbert writes:
> > We found this on Linux 2.6 kernel. I'd recommend capturing a trace
> > and examining in wireshark. We tried lots of tcp options, but it
> > couldn't find one that had a useful effect.
> > -Eric
>
> Thanks for the info!
Rush,
For what it's worth, our solution was to set an early timer on
our connect call, and if that timer fires then we close the socket and
try again. We happened to know if the connection was local, and set
the timer differently (0.05 seconds vs 1 second) in that case. It was
an ugly hack.
-Eric
Re: C++ Lib: TSocket and SO_LINGER strangeness
Posted by Rush Manbert <ru...@manbert.com>.
Hi Eric,
On Jun 25, 2009, at 5:42 PM, Eric Anderson wrote:
> Rush Manbert writes:
>> [ seeing TCP resets ]
>
Nice summary. :-)
> You may be seeing the same thing we do under some stress tests with a
> different app. The problem we tracked down to Linux re-using port
> numbers before the timeout has expired. We saw something a little
> different, namely syn, syn-ack, rst. We debugged it using wireshark,
> which just reported that the ports were re-used. I also wrote a
> script that looked at port-reuse time and found that it happend a lot
> under the 60 second (might be 30) timeout, but it only rarely mattered
> to us because we had many different servers.
syn, syn-ack, rst is pretty much what happens in the client/server
exchange that fails. I'll get wireshark and see what I can find out.
One good thing about this whole exercise is that I'm learning how to
use a bunch of new-to-me tools.
>
> We found this on Linux 2.6 kernel. I'd recommend capturing a trace
> and examining in wireshark. We tried lots of tcp options, but it
> couldn't find one that had a useful effect.
> -Eric
Thanks for the info!
- Rush
Re: C++ Lib: TSocket and SO_LINGER strangeness
Posted by Eric Anderson <er...@hp.com>.
Rush Manbert writes:
> [ seeing TCP resets ]
You may be seeing the same thing we do under some stress tests with a
different app. The problem we tracked down to Linux re-using port
numbers before the timeout has expired. We saw something a little
different, namely syn, syn-ack, rst. We debugged it using wireshark,
which just reported that the ports were re-used. I also wrote a
script that looked at port-reuse time and found that it happend a lot
under the 60 second (might be 30) timeout, but it only rarely mattered
to us because we had many different servers.
We found this on Linux 2.6 kernel. I'd recommend capturing a trace
and examining in wireshark. We tried lots of tcp options, but it
couldn't find one that had a useful effect.
-Eric