You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Michael Ho (Code Review)" <ge...@cloudera.org> on 2019/06/21 03:22:42 UTC

[kudu-CR] KUDU-2192: Enable TCP keepalive for all outbound connections

Michael Ho has uploaded this change for review. ( http://gerrit.cloudera.org:8080/13702


Change subject: KUDU-2192: Enable TCP keepalive for all outbound connections
......................................................................

KUDU-2192: Enable TCP keepalive for all outbound connections

This change enables TCP keepalive for all outbound connections.
This aims to handle cases in which the remote peer may have
dropped off the network without sending a TCP RST. For instance,
a remote host could have hit a kernel panic and got power cycled.
In which case, the existing TCP connection to that host may be
stale. In an idle cluster, this stale connection may not be detected
until the next use of it, in which case it will result in a RPC
failure due to TCP RST sent from the restarted peer.

By enabling TCP keepalive, we ensure that stale TCP connections
in an idle cluster will be detected and closed within a time bound
so a new connection will be created on the next use. This change
introduces 3 different flags:

--tcp_keepalive_probe_period_s: the duration in seconds a TCP connection
has to be idle before keepalive probes started to be sent.

--tcp_keepalive_retry_period_s: the duration in seconds between successive
keepalive probes if previous probes didn't get an ACK from remote peer.

--tcp_keepalive_retry_count: the maximum number of TCP keepalive probes
sent without an ACK before declaring the remote peer as dead.

Testing:
- Used TCP dump to verify that keepalive probes are being sent periodically.
- Verified that blocking all incoming traffic to a server's port via an iptable
rule caused the TCP connection to be closed and the keepalive probes to stop
eventually.

Change-Id: Iaa1d66d83aea1cc82d07fc6217be5fc1306695bc
---
M src/kudu/rpc/reactor.cc
M src/kudu/rpc/reactor.h
M src/kudu/rpc/rpc-test.cc
M src/kudu/util/net/socket.cc
M src/kudu/util/net/socket.h
5 files changed, 90 insertions(+), 1 deletion(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/02/13702/1
-- 
To view, visit http://gerrit.cloudera.org:8080/13702
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Iaa1d66d83aea1cc82d07fc6217be5fc1306695bc
Gerrit-Change-Number: 13702
Gerrit-PatchSet: 1
Gerrit-Owner: Michael Ho <kw...@cloudera.com>

[kudu-CR] KUDU-2192: Enable TCP keepalive for all outbound connections

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/13702 )

Change subject: KUDU-2192: Enable TCP keepalive for all outbound connections
......................................................................


Patch Set 4:

Can some committers of Kudu please merge this patch for me ? I cannot merge it myself. Thanks.


-- 
To view, visit http://gerrit.cloudera.org:8080/13702
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iaa1d66d83aea1cc82d07fc6217be5fc1306695bc
Gerrit-Change-Number: 13702
Gerrit-PatchSet: 4
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 27 Jun 2019 22:34:03 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2192: Enable TCP keepalive for all outbound connections

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/13702

to look at the new patch set (#2).

Change subject: KUDU-2192: Enable TCP keepalive for all outbound connections
......................................................................

KUDU-2192: Enable TCP keepalive for all outbound connections

This change enables TCP keepalive for all outbound connections.
This aims to handle cases in which the remote peer may have
dropped off the network without sending a TCP RST. For instance,
a remote host could have hit a kernel panic and got power cycled.
In which case, the existing TCP connection to that host may be
stale. In an idle cluster, this stale connection may not be detected
until the next use of it, in which case it will result in a RPC
failure due to TCP RST sent from the restarted peer.

By enabling TCP keepalive, we ensure that stale TCP connections
in an idle cluster will be detected and closed within a time bound
so a new connection will be created on the next use. This change
introduces 3 different flags:

--tcp_keepalive_probe_period_s: the duration in seconds a TCP connection
has to be idle before keepalive probes started to be sent.

--tcp_keepalive_retry_period_s: the duration in seconds between successive
keepalive probes if previous probes didn't get an ACK from remote peer.

--tcp_keepalive_retry_count: the maximum number of TCP keepalive probes
sent without an ACK before declaring the remote peer as dead.

Testing:
- Used TCP dump to verify that keepalive probes are being sent periodically.
- Verified that blocking all incoming traffic to a server's port via an iptable
rule caused the TCP connection to be closed and the keepalive probes to stop
eventually.

Change-Id: Iaa1d66d83aea1cc82d07fc6217be5fc1306695bc
---
M src/kudu/rpc/connection.cc
M src/kudu/rpc/connection.h
M src/kudu/rpc/reactor.cc
M src/kudu/rpc/reactor.h
M src/kudu/rpc/rpc-test.cc
M src/kudu/util/net/socket.cc
M src/kudu/util/net/socket.h
7 files changed, 111 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/02/13702/2
-- 
To view, visit http://gerrit.cloudera.org:8080/13702
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iaa1d66d83aea1cc82d07fc6217be5fc1306695bc
Gerrit-Change-Number: 13702
Gerrit-PatchSet: 2
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)

[kudu-CR] KUDU-2192: Enable TCP keepalive for all outbound connections

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/13702 )

Change subject: KUDU-2192: Enable TCP keepalive for all outbound connections
......................................................................


Patch Set 4: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/13702
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iaa1d66d83aea1cc82d07fc6217be5fc1306695bc
Gerrit-Change-Number: 13702
Gerrit-PatchSet: 4
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 27 Jun 2019 05:50:26 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2192: Enable TCP keepalive for all outbound connections

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/13702 )

Change subject: KUDU-2192: Enable TCP keepalive for all outbound connections
......................................................................


Patch Set 4:

(5 comments)

I tested with this patch with TLS and Kerberos enabled: 

client connection to 172.31.115.178:27000 recv error: Network error: failed to read from TLS socket (remote: 0.0.0.0:0): Connection reset by peer (error 104)

The connection was closed to the peer was closed shortly after the iptables rule was installed. Confirmed that a new connection was re-created afterwards.

http://gerrit.cloudera.org:8080/#/c/13702/3/src/kudu/rpc/reactor.cc
File src/kudu/rpc/reactor.cc:

http://gerrit.cloudera.org:8080/#/c/13702/3/src/kudu/rpc/reactor.cc@631
PS3, Line 631:     Status keepalive_status = conn->SetTcpKeepAlive(
> nit: consider using scope-only variable here, something like
Done


http://gerrit.cloudera.org:8080/#/c/13702/3/src/kudu/rpc/rpc-test.cc
File src/kudu/rpc/rpc-test.cc:

http://gerrit.cloudera.org:8080/#/c/13702/3/src/kudu/rpc/rpc-test.cc@790
PS3, Line 790:   // Set up server.
> As Todd already mentioned, it would be great to have a scenario when connec
There is no easy way to simulate this situation programmatically without using iptables or tc qdisc netem command.

I manually tested with iptables rules with and without TLS by applying this patch in Impala code base.


http://gerrit.cloudera.org:8080/#/c/13702/3/src/kudu/rpc/rpc-test.cc@798
PS3, Line 798:   FLAGS_tcp_keepalive_retry_count = 1;
> no need for this -- our toplevel KuduTest base class handles this for all t
Done


http://gerrit.cloudera.org:8080/#/c/13702/3/src/kudu/util/net/socket.h
File src/kudu/util/net/socket.h:

http://gerrit.cloudera.org:8080/#/c/13702/3/src/kudu/util/net/socket.h@160
PS3, Line 160: 
> nit: it's a bit strange to see MonoDelta here, supposedly lower-level than 
Done


http://gerrit.cloudera.org:8080/#/c/13702/3/src/kudu/util/net/socket.cc
File src/kudu/util/net/socket.cc:

http://gerrit.cloudera.org:8080/#/c/13702/3/src/kudu/util/net/socket.cc@591
PS3, Line 591: DCHECK_GT(id
> nit: static const char* const ?
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/13702
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iaa1d66d83aea1cc82d07fc6217be5fc1306695bc
Gerrit-Change-Number: 13702
Gerrit-PatchSet: 4
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Tue, 25 Jun 2019 21:27:03 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2192: Enable TCP keepalive for all outbound connections

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/13702 )

Change subject: KUDU-2192: Enable TCP keepalive for all outbound connections
......................................................................


Patch Set 4: Code-Review+1


-- 
To view, visit http://gerrit.cloudera.org:8080/13702
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iaa1d66d83aea1cc82d07fc6217be5fc1306695bc
Gerrit-Change-Number: 13702
Gerrit-PatchSet: 4
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Wed, 26 Jun 2019 04:56:05 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2192: Enable TCP keepalive for all outbound connections

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Kudu Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/13702

to look at the new patch set (#3).

Change subject: KUDU-2192: Enable TCP keepalive for all outbound connections
......................................................................

KUDU-2192: Enable TCP keepalive for all outbound connections

This change enables TCP keepalive for all outbound connections.
This aims to handle cases in which the remote peer may have
dropped off the network without sending a TCP RST. For instance,
a remote host could have hit a kernel panic and got power cycled.
In which case, the existing TCP connection to that host may be
stale. In an idle cluster, this stale connection may not be detected
until the next use of it, in which case it will result in a RPC
failure due to TCP RST sent from the restarted peer.

By enabling TCP keepalive, we ensure that stale TCP connections
in an idle cluster will be detected and closed within a time bound
so a new connection will be created on the next use. This change
introduces 3 different flags:

--tcp_keepalive_probe_period_s: the duration in seconds a TCP connection
has to be idle before keepalive probes started to be sent.

--tcp_keepalive_retry_period_s: the duration in seconds between successive
keepalive probes if previous probes didn't get an ACK from remote peer.

--tcp_keepalive_retry_count: the maximum number of TCP keepalive probes
sent without an ACK before declaring the remote peer as dead.

Testing:
- Used TCP dump to verify that keepalive probes are being sent periodically.
- Verified that blocking all incoming traffic to a server's port via an iptable
rule caused the TCP connection to be closed and the keepalive probes to stop
eventually.

Change-Id: Iaa1d66d83aea1cc82d07fc6217be5fc1306695bc
---
M src/kudu/rpc/connection.cc
M src/kudu/rpc/connection.h
M src/kudu/rpc/reactor.cc
M src/kudu/rpc/reactor.h
M src/kudu/rpc/rpc-test.cc
M src/kudu/util/net/socket.cc
M src/kudu/util/net/socket.h
7 files changed, 112 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/02/13702/3
-- 
To view, visit http://gerrit.cloudera.org:8080/13702
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iaa1d66d83aea1cc82d07fc6217be5fc1306695bc
Gerrit-Change-Number: 13702
Gerrit-PatchSet: 3
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)

[kudu-CR] KUDU-2192: Enable TCP keepalive for all outbound connections

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/13702 )

Change subject: KUDU-2192: Enable TCP keepalive for all outbound connections
......................................................................


Patch Set 3:

The server shutdown test case in rpc-test is a good one but just wondering if there are other similar types of test in the test suite which I can extend to test this patch ?


-- 
To view, visit http://gerrit.cloudera.org:8080/13702
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iaa1d66d83aea1cc82d07fc6217be5fc1306695bc
Gerrit-Change-Number: 13702
Gerrit-PatchSet: 3
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Fri, 21 Jun 2019 17:31:18 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2192: Enable TCP keepalive for all outbound connections

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/13702 )

Change subject: KUDU-2192: Enable TCP keepalive for all outbound connections
......................................................................


Patch Set 3:

(1 comment)

Were you able to test this in any way using iptables to drop packets? I'm particularly wondering whether our various network code paths handle ENOTCONN in a correct way (I'd assume they would just treat it as any other read error and disconnect, but would be good to try this once)

http://gerrit.cloudera.org:8080/#/c/13702/3/src/kudu/rpc/rpc-test.cc
File src/kudu/rpc/rpc-test.cc:

http://gerrit.cloudera.org:8080/#/c/13702/3/src/kudu/rpc/rpc-test.cc@798
PS3, Line 798:     google::FlagSaver saver;
no need for this -- our toplevel KuduTest base class handles this for all test cases automatically



-- 
To view, visit http://gerrit.cloudera.org:8080/13702
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iaa1d66d83aea1cc82d07fc6217be5fc1306695bc
Gerrit-Change-Number: 13702
Gerrit-PatchSet: 3
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Fri, 21 Jun 2019 16:34:36 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2192: Enable TCP keepalive for all outbound connections

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/13702 )

Change subject: KUDU-2192: Enable TCP keepalive for all outbound connections
......................................................................


Patch Set 3:

I mostly tested this change using iptables in the context of Impala's use of KRPC and I was able to get error message like:

connection.cc:518] client connection to 127.0.0.1:27000 recv error: Network error: recv error from 0.0.0.0:0: Transport endpoint is not connected (error 107)

I believe there are probably more paths in Kudu which I haven't tried out yet but I am not too familiar with Kudu so any advice on which area to look.


-- 
To view, visit http://gerrit.cloudera.org:8080/13702
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iaa1d66d83aea1cc82d07fc6217be5fc1306695bc
Gerrit-Change-Number: 13702
Gerrit-PatchSet: 3
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Fri, 21 Jun 2019 17:29:50 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2192: Enable TCP keepalive for all outbound connections

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/13702 )

Change subject: KUDU-2192: Enable TCP keepalive for all outbound connections
......................................................................


Patch Set 3:

> Patch Set 3:
> 
> The server shutdown test case in rpc-test is a good one but just wondering if there are other similar types of test in the test suite which I can extend to test this patch ?

I think it's a bit difficult to test this one without iptables, right? If you kill -STOP a server would it stop responding to TCP keepalives? I would guess that the keepalive is handled at the kernel level and even if the process is kill -STOPed, the keepalives will keep getting handled.

At one point I prototyped iptables integration into our minicluster but didn't quite finish that, and wouldn't block this patch. I think some manual testing is probably sufficient. Can you run your same Impala iptables test but with RPC TLS enabled? That path is different enough that I could see a problem happening there.


-- 
To view, visit http://gerrit.cloudera.org:8080/13702
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iaa1d66d83aea1cc82d07fc6217be5fc1306695bc
Gerrit-Change-Number: 13702
Gerrit-PatchSet: 3
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Fri, 21 Jun 2019 17:37:23 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2192: Enable TCP keepalive for all outbound connections

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Alexey Serbin, Kudu Jenkins, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/13702

to look at the new patch set (#4).

Change subject: KUDU-2192: Enable TCP keepalive for all outbound connections
......................................................................

KUDU-2192: Enable TCP keepalive for all outbound connections

This change enables TCP keepalive for all outbound connections.
This aims to handle cases in which the remote peer may have
dropped off the network without sending a TCP RST. For instance,
a remote host could have hit a kernel panic and got power cycled.
In which case, the existing TCP connection to that host may be
stale. In an idle cluster, this stale connection may not be detected
until the next use of it, in which case it will result in a RPC
failure due to TCP RST sent from the restarted peer.

By enabling TCP keepalive, we ensure that stale TCP connections
in an idle cluster will be detected and closed within a time bound
so a new connection will be created on the next use. This change
introduces 3 different flags:

--tcp_keepalive_probe_period_s: the duration in seconds a TCP connection
has to be idle before keepalive probes started to be sent.

--tcp_keepalive_retry_period_s: the duration in seconds between successive
keepalive probes if previous probes didn't get an ACK from remote peer.

--tcp_keepalive_retry_count: the maximum number of TCP keepalive probes
sent without an ACK before declaring the remote peer as dead.

Testing:
- Used TCP dump to verify that keepalive probes are being sent periodically.
- Verified that blocking all incoming traffic to a server's port via an iptable
rule caused the TCP connection to be closed and the keepalive probes to stop
eventually.

Change-Id: Iaa1d66d83aea1cc82d07fc6217be5fc1306695bc
---
M src/kudu/rpc/connection.cc
M src/kudu/rpc/connection.h
M src/kudu/rpc/reactor.cc
M src/kudu/rpc/reactor.h
M src/kudu/rpc/rpc-test.cc
M src/kudu/util/net/socket.cc
M src/kudu/util/net/socket.h
7 files changed, 112 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/02/13702/4
-- 
To view, visit http://gerrit.cloudera.org:8080/13702
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iaa1d66d83aea1cc82d07fc6217be5fc1306695bc
Gerrit-Change-Number: 13702
Gerrit-PatchSet: 4
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-2192: Enable TCP keepalive for all outbound connections

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/13702 )

Change subject: KUDU-2192: Enable TCP keepalive for all outbound connections
......................................................................

KUDU-2192: Enable TCP keepalive for all outbound connections

This change enables TCP keepalive for all outbound connections.
This aims to handle cases in which the remote peer may have
dropped off the network without sending a TCP RST. For instance,
a remote host could have hit a kernel panic and got power cycled.
In which case, the existing TCP connection to that host may be
stale. In an idle cluster, this stale connection may not be detected
until the next use of it, in which case it will result in a RPC
failure due to TCP RST sent from the restarted peer.

By enabling TCP keepalive, we ensure that stale TCP connections
in an idle cluster will be detected and closed within a time bound
so a new connection will be created on the next use. This change
introduces 3 different flags:

--tcp_keepalive_probe_period_s: the duration in seconds a TCP connection
has to be idle before keepalive probes started to be sent.

--tcp_keepalive_retry_period_s: the duration in seconds between successive
keepalive probes if previous probes didn't get an ACK from remote peer.

--tcp_keepalive_retry_count: the maximum number of TCP keepalive probes
sent without an ACK before declaring the remote peer as dead.

Testing:
- Used TCP dump to verify that keepalive probes are being sent periodically.
- Verified that blocking all incoming traffic to a server's port via an iptable
rule caused the TCP connection to be closed and the keepalive probes to stop
eventually.

Change-Id: Iaa1d66d83aea1cc82d07fc6217be5fc1306695bc
Reviewed-on: http://gerrit.cloudera.org:8080/13702
Reviewed-by: Alexey Serbin <as...@cloudera.com>
Reviewed-by: Todd Lipcon <to...@apache.org>
Tested-by: Kudu Jenkins
---
M src/kudu/rpc/connection.cc
M src/kudu/rpc/connection.h
M src/kudu/rpc/reactor.cc
M src/kudu/rpc/reactor.h
M src/kudu/rpc/rpc-test.cc
M src/kudu/util/net/socket.cc
M src/kudu/util/net/socket.h
7 files changed, 112 insertions(+), 2 deletions(-)

Approvals:
  Alexey Serbin: Looks good to me, but someone else must approve
  Todd Lipcon: Looks good to me, approved
  Kudu Jenkins: Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/13702
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Iaa1d66d83aea1cc82d07fc6217be5fc1306695bc
Gerrit-Change-Number: 13702
Gerrit-PatchSet: 5
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-2192: Enable TCP keepalive for all outbound connections

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/13702 )

Change subject: KUDU-2192: Enable TCP keepalive for all outbound connections
......................................................................


Patch Set 3:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/13702/3/src/kudu/rpc/reactor.cc
File src/kudu/rpc/reactor.cc:

http://gerrit.cloudera.org:8080/#/c/13702/3/src/kudu/rpc/reactor.cc@631
PS3, Line 631:     s = conn->SetTcpKeepAlive(FLAGS_tcp_keepalive_probe_period_s + rng_.Uniform32(4),
nit: consider using scope-only variable here, something like

auto s = ...;


http://gerrit.cloudera.org:8080/#/c/13702/3/src/kudu/rpc/rpc-test.cc
File src/kudu/rpc/rpc-test.cc:

http://gerrit.cloudera.org:8080/#/c/13702/3/src/kudu/rpc/rpc-test.cc@790
PS3, Line 790: TEST_P(TestRpc, TestTCPKeepalive) {
As Todd already mentioned, it would be great to have a scenario when connection is dropped due to non-acked keepalive probes.


http://gerrit.cloudera.org:8080/#/c/13702/3/src/kudu/util/net/socket.h
File src/kudu/util/net/socket.h:

http://gerrit.cloudera.org:8080/#/c/13702/3/src/kudu/util/net/socket.h@160
PS3, Line 160: MonoDelta
nit: it's a bit strange to see MonoDelta here, supposedly lower-level than rpc/connection.h, but see int as number of seconds in rpc/connnection.h

Maybe, unify that and use integer type everywhere?


http://gerrit.cloudera.org:8080/#/c/13702/3/src/kudu/util/net/socket.cc
File src/kudu/util/net/socket.cc:

http://gerrit.cloudera.org:8080/#/c/13702/3/src/kudu/util/net/socket.cc@591
PS3, Line 591: const string
nit: static const char* const ?



-- 
To view, visit http://gerrit.cloudera.org:8080/13702
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iaa1d66d83aea1cc82d07fc6217be5fc1306695bc
Gerrit-Change-Number: 13702
Gerrit-PatchSet: 3
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Fri, 21 Jun 2019 22:00:58 +0000
Gerrit-HasComments: Yes