You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Michael Ho (Code Review)" <ge...@cloudera.org> on 2017/06/02 18:21:43 UTC

[Impala-ASF-CR] IMPALA-5388: Don't retry RPC calls on TSSLException

Michael Ho has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/7063

Change subject: IMPALA-5388: Don't retry RPC calls on TSSLException
......................................................................

IMPALA-5388: Don't retry RPC calls on TSSLException

Our RPC implementation would retry the RPC once if the first
invocation fails. The reasoning, as far as I understand, is
that the connection may have been closed so the retry logic
will reopen the connection and do the RPC call again. However,
with TLS enabled, the existing code will misinterpret TSSLException
as a lost connection and retry the RPC. TSSLException may be
thrown in both the send and recv RPC calls so it may occur
that the send may have suceeded already and resending the RPC
is not safe. As shown in IMPALA-5388, it can lead to wrong
query results.

This change fixes the problem by unconditionally returning
RPC_GENERAL_ERROR to the caller on TSSLException and don't
retry the RPC call.

Testing: Michael Brown will re-run the stress test in the secure
cluster. He's currently facing some deployment issue with CM and
cannot make much progress.

Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
---
M be/src/runtime/client-cache.h
1 file changed, 4 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/7063/1
-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged.

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................


IMPALA-5388: Only retry RPC on lost connection in send call

Previously, DoRpc() blacklists only a couple of conditions
which shouldn't retry the RPC on exception. This is fragile
as the errors could have happened after the payload has been
successfully sent to the destination. Such aggressive retry
behavior can lead to duplicated row batches being sent, causing
wrong results in queries.

This change fixes the problem by whitelisting the conditions
in which the RPC can be retried. Specifically, it pattern-matches
against certain errors in TSocket::write_partial() in the thrift
library and only retries the RPC in those cases. With SSL enabled,
we will never retry. We should investigate whether there are some
cases in which it's safe to retry.

This change also adds fault injection in the TransmitData() RPC
caller's path to emulate different exception cases.

Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Reviewed-on: http://gerrit.cloudera.org:8080/7063
Reviewed-by: Michael Ho <kw...@cloudera.com>
Tested-by: Impala Public Jenkins
---
M be/src/common/global-flags.cc
M be/src/rpc/thrift-util.cc
M be/src/rpc/thrift-util.h
M be/src/runtime/backend-client.h
M be/src/runtime/client-cache.h
M be/src/testutil/CMakeLists.txt
A be/src/testutil/fault-injection-util.cc
M be/src/testutil/fault-injection-util.h
A tests/custom_cluster/test_rpc_exception.py
9 files changed, 300 insertions(+), 54 deletions(-)

Approvals:
  Impala Public Jenkins: Verified
  Michael Ho: Looks good to me, approved



-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 12
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alan Choi <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................


Patch Set 4:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/7063/4/be/src/rpc/thrift-util.cc
File be/src/rpc/thrift-util.cc:

Line 184:   // String taken from TSocket::write_partial() in Thrift's TSocket.cpp
if we upgrade thrift, won't we need to recheck these? how will we remember to do that?


Line 188:        strstr(e.what(), "Socket send returned 0.") != nullptr);
how do we know this last one is due to a dropped connection?

also, for all of them, what happens if the receiver received some bytes before the connection was lost? how does restarting the rpc from the beginning work in that case?


-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alan Choi <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................


Patch Set 9: Code-Review+2

+2 after addressing Henry's comments.

-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alan Choi <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has uploaded a new patch set (#8).

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................

IMPALA-5388: Only retry RPC on lost connection in send call

Previously, DoRpc() blacklists only a couple of conditions
which shouldn't retry the RPC on exception. This is fragile
as the errors could have happened after the payload has been
successfully sent to the destination. Such aggressive retry
behavior can lead to duplicated row batches being sent, causing
wrong results in queries.

This change fixes the problem by whitelisting the conditions
in which the RPC can be retried. Specifically, it pattern-matches
against certain errors in TSocket::write_partial() in the thrift
library and only retries the RPC in those cases. With SSL enabled,
we will never retry. We should investigate whether there are some
cases in which it's safe to retry.

This change also adds fault injection in the TransmitData() RPC
caller's path to emulate different exception cases.

Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
---
M be/src/common/global-flags.cc
M be/src/rpc/thrift-util.cc
M be/src/rpc/thrift-util.h
M be/src/runtime/backend-client.h
M be/src/runtime/client-cache.h
M be/src/testutil/CMakeLists.txt
A be/src/testutil/fault-injection-util.cc
M be/src/testutil/fault-injection-util.h
A tests/custom_cluster/test_rpc_exception.py
9 files changed, 290 insertions(+), 54 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/7063/8
-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alan Choi <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7063/1/be/src/runtime/client-cache.h
File be/src/runtime/client-cache.h:

PS1, Line 259: catch (const apache::thrift::TException& e) {
             :       if (IsRecvTimeoutTException(e)) {
             :         return Status(TErrorCode::RPC_RECV_TIMEOUT, strings::Substitute(
             :             "Client $0 timed-out during recv call.", TNetworkAddressToString(address_)));
             :       }
             :       VLOG(1) << "client " << client_ << " unexpected exception: "
             :               << e.what() << ", type=" << typeid(e).name();
             : 
             :       // Client may have unexpectedly been closed, so re-open and retry.
             :       // TODO: ThriftClient should return proper error codes.
             :       const Status& status = Reopen();
             :       if (!status.ok()) {
             :         if (retry_is_safe != NULL) *retry_is_safe = true;
             :         return Status(TErrorCode::RPC_CLIENT_CONNECT_FAILURE, status.GetDetail());
             :       }
             :       try {
             :         (client_->*f)(*response, request);
             :       } catch (apache::thrift::TException& e) {
             :         // By this point the RPC really has failed.
             :         // TODO: Revisit this logic later. It's possible that the new connection
             :         // works but we hit timeout here.
             :         return Status(TErrorCode::RPC_GENERAL_ERROR, e.what());
> I agree. There is actually a TODO for it in the header comment. I didn't re
Implemented this idea in PS2.


-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5388: Don't retry RPC calls on TSSLException

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-5388: Don't retry RPC calls on TSSLException
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7063/1/be/src/runtime/client-cache.h
File be/src/runtime/client-cache.h:

PS1, Line 259: catch (const apache::thrift::TException& e) {
             :       if (IsRecvTimeoutTException(e)) {
             :         return Status(TErrorCode::RPC_RECV_TIMEOUT, strings::Substitute(
             :             "Client $0 timed-out during recv call.", TNetworkAddressToString(address_)));
             :       }
             :       VLOG(1) << "client " << client_ << " unexpected exception: "
             :               << e.what() << ", type=" << typeid(e).name();
             : 
             :       // Client may have unexpectedly been closed, so re-open and retry.
             :       // TODO: ThriftClient should return proper error codes.
             :       const Status& status = Reopen();
             :       if (!status.ok()) {
             :         if (retry_is_safe != NULL) *retry_is_safe = true;
             :         return Status(TErrorCode::RPC_CLIENT_CONNECT_FAILURE, status.GetDetail());
             :       }
             :       try {
             :         (client_->*f)(*response, request);
             :       } catch (apache::thrift::TException& e) {
             :         // By this point the RPC really has failed.
             :         // TODO: Revisit this logic later. It's possible that the new connection
             :         // works but we hit timeout here.
             :         return Status(TErrorCode::RPC_GENERAL_ERROR, e.what());
> The more I stare at this, the more I think it's broken even without SSL. It
I agree. There is actually a TODO for it in the header comment. I didn't restructure the code in this change because that would be a more intrusive change and it's risky to restructure this heavily used path late into a release.


-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................


Patch Set 11: Code-Review+2

Fix the static_assert(). Verified it builds with clang now. Carry +2 forward.

-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 11
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alan Choi <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................


Patch Set 2:

I agree Tim. This in general is not very robust and some RPCs are not really idempotent. That said, this is a pre-existing problem in the code which I don't want to address this late into the release.

Stepping back a bit, I wonder what the historical motivation to add the retry logic in DoRPC() to begin with. My suspicion is that it's implementing a common pattern most callers would expect (i.e. retry the RPC on certain failures).

-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5388: Don't retry RPC calls on TSSLException

Posted by "Henry Robinson (Code Review)" <ge...@cloudera.org>.
Henry Robinson has posted comments on this change.

Change subject: IMPALA-5388: Don't retry RPC calls on TSSLException
......................................................................


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/7063/1/be/src/runtime/client-cache.h
File be/src/runtime/client-cache.h:

PS1, Line 258:  Status(TErrorCode::RPC_GENERAL_ERROR, e.what());
> Should we consider returning RPC_RECV_TIMEOUT instead if e.what() contains 
I don't think so - RPC_RECV_TIMEOUT is only used elsewhere to properly print an error message.


PS1, Line 259: catch (const apache::thrift::TException& e) {
             :       if (IsRecvTimeoutTException(e)) {
             :         return Status(TErrorCode::RPC_RECV_TIMEOUT, strings::Substitute(
             :             "Client $0 timed-out during recv call.", TNetworkAddressToString(address_)));
             :       }
             :       VLOG(1) << "client " << client_ << " unexpected exception: "
             :               << e.what() << ", type=" << typeid(e).name();
             : 
             :       // Client may have unexpectedly been closed, so re-open and retry.
             :       // TODO: ThriftClient should return proper error codes.
             :       const Status& status = Reopen();
             :       if (!status.ok()) {
             :         if (retry_is_safe != NULL) *retry_is_safe = true;
             :         return Status(TErrorCode::RPC_CLIENT_CONNECT_FAILURE, status.GetDetail());
             :       }
             :       try {
             :         (client_->*f)(*response, request);
             :       } catch (apache::thrift::TException& e) {
             :         // By this point the RPC really has failed.
             :         // TODO: Revisit this logic later. It's possible that the new connection
             :         // works but we hit timeout here.
             :         return Status(TErrorCode::RPC_GENERAL_ERROR, e.what());
The more I stare at this, the more I think it's broken even without SSL. It's pretty clear to see that TExceptions can be thrown by TSocket on its read() path which would lead to a spurious retry in any case. It looks like TSocket gives a narrow set of error codes for the 'socket not open / conn reset' error cases that would be better used here.

In fact I just tried this, and by throwing a TException between writing and reading in one TransmitData() RPC, I can get wrong results pretty easily.

I think we need to restructure this block to narrow the retried RPCs only to those a) on the write path and b) that have the error code NOT_OPEN.


-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................


Patch Set 2:

*I don't necessarily mean we need to fix it in this patchset, but it seems like this is still a possible source of hard-to-diagnose bugs, so we should definitely prioritise doing it as a follow-on.

-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7063/2/be/src/runtime/client-cache.h
File be/src/runtime/client-cache.h:

PS2, Line 262: Status(TErrorCode::RPC_GENERAL_ERROR, e.what());
Please note that this will start returning "RPC: SSL resource temporarily unavailable" error and possibly some other errors when the cluster is under stress. Previously, we would transparently retry the RPC (which isn't the right thing to do all the time). I would like to point this change of behavior out to reviewers to make sure that the risk is acceptable and it won't be causing escalations.


-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Henry Robinson (Code Review)" <ge...@cloudera.org>.
Henry Robinson has posted comments on this change.

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................


Patch Set 2:

(4 comments)

Were you able to test this at all?

The idea behind retrying in DoRpc() was to insulate callers from having to detect and implement re-open / retry loops themselves. Because the client-cache keeps connections around for a long time, they would sometimes get closed by the remote end. That expanded to include the logic around retrying the recv part of an RPC if the socket read timeout had expired, since TransmitData() can block for a long time before responding. Arguably it might be better to set the timeouts for the individual RPC (that's what happens in KRPC). 

Although I think this is brittle, I feel like it's relatively correct _if_ our assumption that the socket-closed exceptions we detect only get thrown on the write path is accurate. It seems to be the case for non-secure clusters; for Kerberos or TLS it's less clear (but it seems that receiving the response only involves read operations). 

Can you see a situation with this patch where an RPC will be spuriously retried?

http://gerrit.cloudera.org:8080/#/c/7063/2/be/src/runtime/client-cache.h
File be/src/runtime/client-cache.h:

PS2, Line 252: else if 
nit: avoid 'else' since previous block will return. Here and below.


PS2, Line 253: <F, Request, Response>
can the compiler not figure out the specialization from the arguments?


PS2, Line 262: Status(TErrorCode::RPC_GENERAL_ERROR, e.what());
> Please note that this will start returning "RPC: SSL resource temporarily u
I think it's ok - it's actually hard to think of many cases where retrying an RPC was a good idea (cancellation, maybe profile reporting...). I'd rather get error messages than incorrect behaviour.


PS2, Line 285: // If it's not timeout exception, then the connection is broken, stop retrying.
Remove?


-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alan Choi <al...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has uploaded a new patch set (#4).

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................

IMPALA-5388: Only retry RPC on lost connection in send call

Previously, DoRpc() blacklists only a couple of conditions
which shouldn't retry the RPC on exception. This is fragile
as the errors could have happened after the payload has been
successfully sent to the destination. Such aggressive retry
behavior can lead to duplicated row batches being sent, causing
wrong results in queries.

This change fixes the problem by whitelisting the conditions
in which the RPC can be retried. Specifically, it pattern-matches
against certain errors in TSocket::write_partial() in the thrift
library and only retries the RPC in those cases.

This change also adds fault injection in the TransmitData() RPC
caller's path to emulate different exception cases.

Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
---
M be/src/common/global-flags.cc
M be/src/rpc/thrift-util.cc
M be/src/rpc/thrift-util.h
M be/src/runtime/backend-client.h
M be/src/runtime/client-cache.h
M be/src/testutil/CMakeLists.txt
A be/src/testutil/fault-injection-util.cc
M be/src/testutil/fault-injection-util.h
A tests/custom_cluster/test_rpc_exception.py
9 files changed, 275 insertions(+), 54 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/7063/4
-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alan Choi <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................


Patch Set 8:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7063/7/be/src/rpc/thrift-util.cc
File be/src/rpc/thrift-util.cc:

Line 194:   DCHECK_EQ("0.9.0", PACKAGE_VERSION) << Substitute("Please update $0() to match "
> and we discussed, let's also handle the TTransportException::TIMED_OUT, "se
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alan Choi <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................


Patch Set 6:

(11 comments)

http://gerrit.cloudera.org:8080/#/c/7063/4/be/src/rpc/thrift-util.cc
File be/src/rpc/thrift-util.cc:

PS4, Line 183:  strstr(e.what(), "EAGA
> should we also retry for "send time out"? I mean the one throw by write().
See comments below.


Line 184: }
> if we upgrade thrift, won't we need to recheck these? how will we remember 
Added a DCHECK for the thrift version string.


Line 188:       "to match the current version of TSocket.cpp";
> That would also be accurate. I think both descriptions are correct.
In that sense, we should also retry on the exception below in write() as Juan mentioned above since we know we haven't completely finished the write in this case:

  while (sent < len) {
    uint32_t b = write_partial(buf + sent, len - sent);
    if (b == 0) {
      // This should only happen if the timeout set with SO_SNDTIMEO expired.
      // Raise an exception.
      throw TTransportException(TTransportException::TIMED_OUT,
                                "send timeout expired");
    }
    sent += b;
  }


http://gerrit.cloudera.org:8080/#/c/7063/4/be/src/runtime/client-cache.h
File be/src/runtime/client-cache.h:

PS4, Line 298: return strings::Substitute("Client $0 hits unexpected exception: $1 type= $2",
             :         client_, e.what(), typeid(e).name());
             :   }
             : 
> nit: prefer Substitute() over stringstream
Done


http://gerrit.cloudera.org:8080/#/c/7063/4/be/src/testutil/fault-injection-util.cc
File be/src/testutil/fault-injection-util.cc:

PS4, Line 51: if 
> remove std::
Actually needed.


PS4, Line 67:       throw TTransportException(TTransportExcept
> maybe DCHECK? Otherwise could be confusing if this silently fails to work a
Done


Line 77:                                   "Called read on non-open socket");
> Isn't this branch supposed to be one level higher - when !is_send?
Oops...Fixed.


http://gerrit.cloudera.org:8080/#/c/7063/4/be/src/testutil/fault-injection-util.h
File be/src/testutil/fault-injection-util.h:

PS4, Line 48: inject
> grammar ("injects delays" maybe?)
Done


PS4, Line 55: inject
> injects an exception in a specified
Done


http://gerrit.cloudera.org:8080/#/c/7063/4/tests/custom_cluster/test_rpc_exception.py
File tests/custom_cluster/test_rpc_exception.py:

PS4, Line 27: \
> not needed
It's actually needed in python for line continuation. Did I misunderstand your comment ?


PS4, Line 33: e
> indent python two spaces, not four
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alan Choi <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7063/4//COMMIT_MSG
Commit Message:

PS4, Line 18: TSocket::write_partial()
where can I find that code these days?


-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alan Choi <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Sailesh Mukil (Code Review)" <ge...@cloudera.org>.
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................


Patch Set 8:

(1 comment)

I looked at the TSaslTransport layer since it seemed like we didn't vet that along with TSocket and TSSLSocket, there are a couple of places where exceptions can be thrown (during reads and writes, not during SASL negotiation), but if we're going to treat them the same as TSSLExceptions, then we can leave the code as it is, i.e. just return a RPC_GENERAL_ERROR. Just wanted to add this as a note here.

http://gerrit.cloudera.org:8080/#/c/7063/8/be/src/runtime/client-cache.h
File be/src/runtime/client-cache.h:

Line 316:     try {
> We already have a log statement in Reopen(). If Reopen() fails, we should l
Good point. I think that should be fine since we only call Reopen() here and nowhere else.


-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alan Choi <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7063/2//COMMIT_MSG
Commit Message:

PS2, Line 13: behvaior
typo


-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alan Choi <al...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Hello Juan Yu,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7063

to look at the new patch set (#9).

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................

IMPALA-5388: Only retry RPC on lost connection in send call

Previously, DoRpc() blacklists only a couple of conditions
which shouldn't retry the RPC on exception. This is fragile
as the errors could have happened after the payload has been
successfully sent to the destination. Such aggressive retry
behavior can lead to duplicated row batches being sent, causing
wrong results in queries.

This change fixes the problem by whitelisting the conditions
in which the RPC can be retried. Specifically, it pattern-matches
against certain errors in TSocket::write_partial() in the thrift
library and only retries the RPC in those cases. With SSL enabled,
we will never retry. We should investigate whether there are some
cases in which it's safe to retry.

This change also adds fault injection in the TransmitData() RPC
caller's path to emulate different exception cases.

Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
---
M be/src/common/global-flags.cc
M be/src/rpc/thrift-util.cc
M be/src/rpc/thrift-util.h
M be/src/runtime/backend-client.h
M be/src/runtime/client-cache.h
M be/src/testutil/CMakeLists.txt
A be/src/testutil/fault-injection-util.cc
M be/src/testutil/fault-injection-util.h
A tests/custom_cluster/test_rpc_exception.py
9 files changed, 291 insertions(+), 54 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/7063/9
-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alan Choi <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Sailesh Mukil (Code Review)" <ge...@cloudera.org>.
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................


Patch Set 9: Code-Review+1

LGTM as well.

-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alan Choi <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7063/4/be/src/rpc/thrift-util.cc
File be/src/rpc/thrift-util.cc:

PS4, Line 183: IsSendCnxLostTException
> See comments below.
Thanks for the comment. PS8 will now match against "send time out" too


-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alan Choi <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Juan Yu (Code Review)" <ge...@cloudera.org>.
Juan Yu has posted comments on this change.

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7063/4/be/src/rpc/thrift-util.cc
File be/src/rpc/thrift-util.cc:

PS4, Line 183: IsSendCnxLostTException
should we also retry for "send time out"? I mean the one throw by write().
In theory, since write_partial() is called in a while loop by write(), any of those error could happen after the first write_partial() so some bytes may reach receiver side already. but that's existing issue as discussed in previous comments, no need to be addressed in this fix.


-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alan Choi <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has uploaded a new patch set (#6).

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................

IMPALA-5388: Only retry RPC on lost connection in send call

Previously, DoRpc() blacklists only a couple of conditions
which shouldn't retry the RPC on exception. This is fragile
as the errors could have happened after the payload has been
successfully sent to the destination. Such aggressive retry
behavior can lead to duplicated row batches being sent, causing
wrong results in queries.

This change fixes the problem by whitelisting the conditions
in which the RPC can be retried. Specifically, it pattern-matches
against certain errors in TSocket::write_partial() in the thrift
library and only retries the RPC in those cases.

This change also adds fault injection in the TransmitData() RPC
caller's path to emulate different exception cases.

Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
---
M be/src/common/global-flags.cc
M be/src/rpc/thrift-util.cc
M be/src/rpc/thrift-util.h
M be/src/runtime/backend-client.h
M be/src/runtime/client-cache.h
M be/src/testutil/CMakeLists.txt
A be/src/testutil/fault-injection-util.cc
M be/src/testutil/fault-injection-util.h
A tests/custom_cluster/test_rpc_exception.py
9 files changed, 274 insertions(+), 54 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/7063/6
-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alan Choi <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Henry Robinson (Code Review)" <ge...@cloudera.org>.
Henry Robinson has posted comments on this change.

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................


Patch Set 3:

Thanks. What about my other question: can you see a situation where, even with this patch, there could be a spurious retry?

-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alan Choi <al...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has uploaded a new patch set (#3).

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................

IMPALA-5388: Only retry RPC on lost connection in send call

Previously, DoRpc() blacklists only a couple of conditions
which shouldn't retry the RPC on exception. This is fragile
as the errors could have happened after the payload has been
successfully sent to the destination. Such aggressive retry
behavior can lead to duplicated row batches being sent, causing
wrong results in queries.

This change fixes the problem by whitelisting the conditions
in which the RPC can be retried. Specifically, it pattern-matches
against certain errors in TSocket::write_partial() in the thrift
library and only retries the RPC in those cases.

Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
---
M be/src/rpc/thrift-util.cc
M be/src/rpc/thrift-util.h
M be/src/runtime/client-cache.h
3 files changed, 58 insertions(+), 32 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/7063/3
-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alan Choi <al...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has uploaded a new patch set (#2).

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................

IMPALA-5388: Only retry RPC on lost connection in send call

Previously, DoRpc() blacklists only a couple of conditions
which shouldn't retry the RPC on exception. This is fragile
as the errors could have happened after the payload has been
successfully sent to the destination. Such aggressive retry
behvaior can lead to duplicated row batches being sent, causing
wrong results in queries.

This change fixes the problem by whitelisting the conditions
in which the RPC can be retried. Specifically, it pattern-matches
against certain errors in TSocket::write_partial() in the thrift
library and only retries the RPC in those cases.

Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
---
M be/src/rpc/thrift-util.cc
M be/src/rpc/thrift-util.h
M be/src/runtime/client-cache.h
3 files changed, 60 insertions(+), 33 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/7063/2
-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Hello Impala Public Jenkins, Juan Yu, Henry Robinson, Sailesh Mukil, Dan Hecht,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7063

to look at the new patch set (#11).

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................

IMPALA-5388: Only retry RPC on lost connection in send call

Previously, DoRpc() blacklists only a couple of conditions
which shouldn't retry the RPC on exception. This is fragile
as the errors could have happened after the payload has been
successfully sent to the destination. Such aggressive retry
behavior can lead to duplicated row batches being sent, causing
wrong results in queries.

This change fixes the problem by whitelisting the conditions
in which the RPC can be retried. Specifically, it pattern-matches
against certain errors in TSocket::write_partial() in the thrift
library and only retries the RPC in those cases. With SSL enabled,
we will never retry. We should investigate whether there are some
cases in which it's safe to retry.

This change also adds fault injection in the TransmitData() RPC
caller's path to emulate different exception cases.

Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
---
M be/src/common/global-flags.cc
M be/src/rpc/thrift-util.cc
M be/src/rpc/thrift-util.h
M be/src/runtime/backend-client.h
M be/src/runtime/client-cache.h
M be/src/testutil/CMakeLists.txt
A be/src/testutil/fault-injection-util.cc
M be/src/testutil/fault-injection-util.h
A tests/custom_cluster/test_rpc_exception.py
9 files changed, 300 insertions(+), 54 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/7063/11
-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 11
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alan Choi <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................


Patch Set 10:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/7063/9/be/src/rpc/thrift-util.cc
File be/src/rpc/thrift-util.cc:

PS9, Line 194:  possible
> can you make this a static assert? Otherwise we might not hit this as early
Done


http://gerrit.cloudera.org:8080/#/c/7063/9/be/src/runtime/client-cache.h
File be/src/runtime/client-cache.h:

PS9, Line 297: std::s
> this should be std::string (I think some google header rudely uses std :/).
Done


PS9, Line 298:  $0 
> nit: prefer "hit an" or "saw an"
Done


PS9, Line 299: TNetwor
> TNetworkAddressToString(address_) (match the other exception messages, not 
Done


http://gerrit.cloudera.org:8080/#/c/7063/9/be/src/testutil/fault-injection-util.cc
File be/src/testutil/fault-injection-util.cc:

PS9, Line 69: // by IsSendFailTException().
> it's unlikely in the extreme, but you could never take this branch if 1024 
Done


http://gerrit.cloudera.org:8080/#/c/7063/9/tests/custom_cluster/test_rpc_exception.py
File tests/custom_cluster/test_rpc_exception.py:

PS9, Line 27: # This que
> can you comment to say that proper test coverage relies on sending Transmit
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 10
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alan Choi <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7063/4/be/src/rpc/thrift-util.cc
File be/src/rpc/thrift-util.cc:

Line 188:        strstr(e.what(), "Socket send returned 0.") != nullptr);
> Reopen() should cycle the TCP connection underneath: Thrift on the server s
Then aren't we really saying that these exceptions guarantee that the send wasn't complete (and that we'll reopen the connection in response), rather than that the connections was already lost? i.e. is this really IsSendIncompleteTException()?


-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alan Choi <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................


Patch Set 3:

Oh right. Sorry for missing that. The concern I have is what happens if write path some how lost the connection after sending half of the RPC argument. I suppose ThriftRPC do provide some level of atomicity (i.e. all or nothing) but it'd be good to have someone more familiar with it confirm this assumption.

-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alan Choi <al...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Henry Robinson (Code Review)" <ge...@cloudera.org>.
Henry Robinson has posted comments on this change.

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................


Patch Set 3:

Right - my reading of the Thrift code is that if there's a partial write followed by a reset, an exception will be thrown before any processing is done since the RPCs are processed synchronously (i.e. the full payload is read before getting handed off the RPC handler).

-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alan Choi <al...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Henry Robinson (Code Review)" <ge...@cloudera.org>.
Henry Robinson has posted comments on this change.

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7063/4/be/src/rpc/thrift-util.cc
File be/src/rpc/thrift-util.cc:

Line 188:        strstr(e.what(), "Socket send returned 0.") != nullptr);
> Then aren't we really saying that these exceptions guarantee that the send 
That would also be accurate. I think both descriptions are correct.


-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alan Choi <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Henry Robinson (Code Review)" <ge...@cloudera.org>.
Henry Robinson has posted comments on this change.

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................


Patch Set 9: Code-Review+1

(6 comments)

Thanks - I think this is a great improvement.

http://gerrit.cloudera.org:8080/#/c/7063/9/be/src/rpc/thrift-util.cc
File be/src/rpc/thrift-util.cc:

PS9, Line 194: DCHECK_EQ
can you make this a static assert? Otherwise we might not hit this as early (e.g. until we run tests).


http://gerrit.cloudera.org:8080/#/c/7063/9/be/src/runtime/client-cache.h
File be/src/runtime/client-cache.h:

PS9, Line 297: string
this should be std::string (I think some google header rudely uses std :/).


PS9, Line 298: hits
nit: prefer "hit an" or "saw an"


PS9, Line 299: client_
TNetworkAddressToString(address_) (match the other exception messages, not the log message that was removed).


http://gerrit.cloudera.org:8080/#/c/7063/9/be/src/testutil/fault-injection-util.cc
File be/src/testutil/fault-injection-util.cc:

PS9, Line 69: if ((send_count.Load() / 1024) % 2 == 0) {
it's unlikely in the extreme, but you could never take this branch if 1024 sends happen between lines 65 and 69. Maybe just cache the return from send_count.Add(1) and use it here?


http://gerrit.cloudera.org:8080/#/c/7063/9/tests/custom_cluster/test_rpc_exception.py
File tests/custom_cluster/test_rpc_exception.py:

PS9, Line 27: TEST_QUERY
can you comment to say that proper test coverage relies on sending TransmitData() > 1024 times?


-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alan Choi <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................


Patch Set 2:

(3 comments)

Thanks for the reviews. Michael Brown ran the stress test in the secure cluster over the weekend with this patch and he didn't find any wrong results so far. I also requested a stress test run with the non-secure cluster.

For testing, I looked into the fault injection logic but didn't look into it in details.

http://gerrit.cloudera.org:8080/#/c/7063/2/be/src/runtime/client-cache.h
File be/src/runtime/client-cache.h:

PS2, Line 252: else if 
> nit: avoid 'else' since previous block will return. Here and below.
Done


PS2, Line 253: <F, Request, Response>
> can the compiler not figure out the specialization from the arguments?
Done


PS2, Line 285: // If it's not timeout exception, then the connection is broken, stop retrying.
> Remove?
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alan Choi <al...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................


Patch Set 3:

(2 comments)

Will update the patch with new tests.

http://gerrit.cloudera.org:8080/#/c/7063/3/be/src/rpc/thrift-util.cc
File be/src/rpc/thrift-util.cc:

PS3, Line 186: write() send()
> As Henry mentioned, the retry is to handle transient failure. There are thr
As discussed offline, this is matching three specific exception cases in write_partial(). EAGAIN is translated to a timeout which may not be safe for retry.


http://gerrit.cloudera.org:8080/#/c/7063/3/be/src/runtime/client-cache.h
File be/src/runtime/client-cache.h:

PS3, Line 256: VLOG(1)
> I think this trace and below one are redundant, the Status will log error, 
Yes, I was just preserving the pre-existing error logging we have in the code.

We can assemble a stringstream instead and pass it to Status constructor.


-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alan Choi <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................


Patch Set 8:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/7063/8/be/src/runtime/client-cache.h
File be/src/runtime/client-cache.h:

Line 310:     // TODO: ThriftClient should return proper error codes.
> DCHECK(client_is_unrecoverable_);
Done


Line 316:     try {
> nit: Should we add a higher level LOG here stating the retry?
We already have a log statement in Reopen(). If Reopen() fails, we should log in line 314 so I guess it should be okay to skip the log statement here. Are you concerned about certain cases in which the log statement will help us with debugging ?


-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alan Choi <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Juan Yu (Code Review)" <ge...@cloudera.org>.
Juan Yu has posted comments on this change.

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................


Patch Set 8: Code-Review+1

-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alan Choi <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................


Patch Set 4:

Added some EE tests for the exception cases in PS4.

-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alan Choi <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................


Patch Set 7:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7063/7/be/src/rpc/thrift-util.cc
File be/src/rpc/thrift-util.cc:

Line 194:        strstr(e.what(), "Socket send returned 0.") != nullptr);
and we discussed, let's also handle the TTransportException::TIMED_OUT, "send timeout expired"
case, and rename this function to IsSendInCompleteTException() or similar.


-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alan Choi <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................


Patch Set 10:

Build started: http://jenkins.impala.io:8080/job/gerrit-verify-dryrun/700/

-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 10
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alan Choi <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................


Patch Set 11: Verified+1

-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 11
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alan Choi <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................


Patch Set 10: Verified-1

Build failed: http://jenkins.impala.io:8080/job/gerrit-verify-dryrun/700/

-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 10
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alan Choi <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Henry Robinson (Code Review)" <ge...@cloudera.org>.
Henry Robinson has posted comments on this change.

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7063/4/be/src/rpc/thrift-util.cc
File be/src/rpc/thrift-util.cc:

Line 188:        strstr(e.what(), "Socket send returned 0.") != nullptr);
> how do we know this last one is due to a dropped connection?
Reopen() should cycle the TCP connection underneath: Thrift on the server should then abandon the first recv(), accept a new connection and handle the repeated RPC in a new context.


-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alan Choi <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................


Patch Set 9: Code-Review+1

LGTM but would be good to have Henry and Sailesh sign off they have no more comments.

-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alan Choi <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................


Patch Set 9: Code-Review+1

Carry +1 forward.

-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alan Choi <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has uploaded a new patch set (#7).

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................

IMPALA-5388: Only retry RPC on lost connection in send call

Previously, DoRpc() blacklists only a couple of conditions
which shouldn't retry the RPC on exception. This is fragile
as the errors could have happened after the payload has been
successfully sent to the destination. Such aggressive retry
behavior can lead to duplicated row batches being sent, causing
wrong results in queries.

This change fixes the problem by whitelisting the conditions
in which the RPC can be retried. Specifically, it pattern-matches
against certain errors in TSocket::write_partial() in the thrift
library and only retries the RPC in those cases. With SSL enabled,
we will never retry. We should investigate whether there are some
cases in which it's safe to retry.

This change also adds fault injection in the TransmitData() RPC
caller's path to emulate different exception cases.

Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
---
M be/src/common/global-flags.cc
M be/src/rpc/thrift-util.cc
M be/src/rpc/thrift-util.h
M be/src/runtime/backend-client.h
M be/src/runtime/client-cache.h
M be/src/testutil/CMakeLists.txt
A be/src/testutil/fault-injection-util.cc
M be/src/testutil/fault-injection-util.h
A tests/custom_cluster/test_rpc_exception.py
9 files changed, 275 insertions(+), 54 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/7063/7
-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alan Choi <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Henry Robinson (Code Review)" <ge...@cloudera.org>.
Henry Robinson has posted comments on this change.

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................


Patch Set 4:

(8 comments)

Thanks for the tests, I think this looks a lot better.

http://gerrit.cloudera.org:8080/#/c/7063/4/be/src/runtime/client-cache.h
File be/src/runtime/client-cache.h:

PS4, Line 298: std::stringstream ss;
             :     ss << "client " << client_ << " unexpected exception: "
             :        << e.what() << ", type=" << typeid(e).name();
             :     return ss.str();
nit: prefer Substitute() over stringstream


http://gerrit.cloudera.org:8080/#/c/7063/4/be/src/testutil/fault-injection-util.cc
File be/src/testutil/fault-injection-util.cc:

PS4, Line 51: std
remove std::


PS4, Line 67: if (target_rpc_type != RPC_TRANSMITDATA) return;
maybe DCHECK? Otherwise could be confusing if this silently fails to work as expected.


Line 77:     } else {
Isn't this branch supposed to be one level higher - when !is_send?


http://gerrit.cloudera.org:8080/#/c/7063/4/be/src/testutil/fault-injection-util.h
File be/src/testutil/fault-injection-util.h:

PS4, Line 48: inject
grammar ("injects delays" maybe?)


PS4, Line 55: inject
injects an exception in a specified


http://gerrit.cloudera.org:8080/#/c/7063/4/tests/custom_cluster/test_rpc_exception.py
File tests/custom_cluster/test_rpc_exception.py:

PS4, Line 27: \
not needed


PS4, Line 33:  
indent python two spaces, not four


-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alan Choi <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Juan Yu (Code Review)" <ge...@cloudera.org>.
Juan Yu has posted comments on this change.

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................


Patch Set 3:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/7063/3/be/src/rpc/thrift-util.cc
File be/src/rpc/thrift-util.cc:

PS3, Line 186: write() send()
As Henry mentioned, the retry is to handle transient failure. There are three failure cases with "NOT_OPEN" why only retry on this one but not other two?
e.g. Isn't "THRIFT_EAGAIN" means safe to retry?


http://gerrit.cloudera.org:8080/#/c/7063/3/be/src/runtime/client-cache.h
File be/src/runtime/client-cache.h:

PS3, Line 256: VLOG(1)
I think this trace and below one are redundant, the Status will log error, won't it?


-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alan Choi <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Hello Juan Yu, Henry Robinson, Sailesh Mukil, Dan Hecht,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7063

to look at the new patch set (#10).

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................

IMPALA-5388: Only retry RPC on lost connection in send call

Previously, DoRpc() blacklists only a couple of conditions
which shouldn't retry the RPC on exception. This is fragile
as the errors could have happened after the payload has been
successfully sent to the destination. Such aggressive retry
behavior can lead to duplicated row batches being sent, causing
wrong results in queries.

This change fixes the problem by whitelisting the conditions
in which the RPC can be retried. Specifically, it pattern-matches
against certain errors in TSocket::write_partial() in the thrift
library and only retries the RPC in those cases. With SSL enabled,
we will never retry. We should investigate whether there are some
cases in which it's safe to retry.

This change also adds fault injection in the TransmitData() RPC
caller's path to emulate different exception cases.

Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
---
M be/src/common/global-flags.cc
M be/src/rpc/thrift-util.cc
M be/src/rpc/thrift-util.h
M be/src/runtime/backend-client.h
M be/src/runtime/client-cache.h
M be/src/testutil/CMakeLists.txt
A be/src/testutil/fault-injection-util.cc
M be/src/testutil/fault-injection-util.h
A tests/custom_cluster/test_rpc_exception.py
9 files changed, 297 insertions(+), 54 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/7063/10
-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 10
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alan Choi <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................


Patch Set 11:

Build started: http://jenkins.impala.io:8080/job/gerrit-verify-dryrun/702/

-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 11
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alan Choi <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................


Patch Set 10: Code-Review+2

Thanks for all the reviews. Carry +2 forward.

-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 10
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alan Choi <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5388: Don't retry RPC calls on TSSLException

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-5388: Don't retry RPC calls on TSSLException
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7063/1/be/src/runtime/client-cache.h
File be/src/runtime/client-cache.h:

PS1, Line 258:  Status(TErrorCode::RPC_GENERAL_ERROR, e.what());
Should we consider returning RPC_RECV_TIMEOUT instead if e.what() contains "ssl_read" ?


-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................


Patch Set 2:

It seems worthwhile to avoid spurious retries for RPCs in general, but it seems like this doesn't robustly solve the problem of sending duplicate row batches - in general I don't think it's possible to determine whether the RPC was delivered or not.

We could avoid duplicate batches entirely if we maintained a sequence number for each sender/receiver pair and included it in each batch. Then the receiver could detect and discard duplicate batches.

-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5388: Only retry RPC on lost connection in send call

Posted by "Sailesh Mukil (Code Review)" <ge...@cloudera.org>.
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-5388: Only retry RPC on lost connection in send call
......................................................................


Patch Set 8:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/7063/8/be/src/runtime/client-cache.h
File be/src/runtime/client-cache.h:

Line 310:     // TODO: ThriftClient should return proper error codes.
DCHECK(client_is_unrecoverable_);


Line 316:     try {
nit: Should we add a higher level LOG here stating the retry?


-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Alan Choi <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes