You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Michael Ho (Code Review)" <ge...@cloudera.org> on 2017/06/20 06:58:55 UTC

[Impala-ASF-CR] IMPALA-5537: Retry RpcRecv on recv timeout exception for SSL connection

Michael Ho has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/7229

Change subject: IMPALA-5537: Retry RpcRecv on recv timeout exception for SSL connection
......................................................................

IMPALA-5537: Retry RpcRecv on recv timeout exception for SSL connection

After the fix for IMPALA-5388, all TSSLException thrown will be
treated as fatal error and the query will fail. Turns out that
this is too strict and in a secure cluster under load, queries
can easily hit timeout waiting for RPC response.

When running without SSL, we call RetryRpcRecv() to retry the recv
part of an RPC if recv() called by the TSocket underlying the RPC
returns an EAGAIN. This change extends that logic to cover secure
connection. In particular, we pattern match against the exception
string "SSL_read: Resource temporarily unavailable" which corresponds
to EAGAIN error code being thrown in the SSL_read() path. The fault
injection utility has also been updated to distinguish between time
out and lost connection to exercise different error handling paths
in the send and recv paths.

After this change, we will still fail the RPC if there is
any error in the send part of a RPC with secure connection.

Change-Id: I8243d4cac93c453e9396b0e24f41e147c8637b8c
---
M be/src/rpc/thrift-util.cc
M be/src/testutil/fault-injection-util.cc
M be/src/testutil/fault-injection-util.h
M tests/custom_cluster/test_rpc_exception.py
4 files changed, 63 insertions(+), 28 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/29/7229/1
-- 
To view, visit http://gerrit.cloudera.org:8080/7229
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I8243d4cac93c453e9396b0e24f41e147c8637b8c
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>

[Impala-ASF-CR] IMPALA-5537: Retry RPC on somes exceptions with SSL connection

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-5537: Retry RPC on somes exceptions with SSL connection
......................................................................


Patch Set 5:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/7229/5/be/src/rpc/thrift-util.cc
File be/src/rpc/thrift-util.cc:

PS5, Line 198: This is not very robust as it's
             : // possible that the receiver of the RPC call may have received all the RPC payload
             : // but the ACK to the sender may have been dropped somehow. In which case, it's
             : // not safe to retry the RPC if it's not idempotent.
> maybe reword this to say the caller should first check if the send was comp
Done


PS5, Line 202: end-to-end tracking of RPC calls to detect duplicated calls in the receiver side
> what does that mean? is it still applicable?
Removed.


http://gerrit.cloudera.org:8080/#/c/7229/5/be/src/runtime/client-cache.h
File be/src/runtime/client-cache.h:

Line 330:     client_is_unrecoverable_ = false;
> i guess this restores the behavior before the patch that introduced this Re
Yes. It was missing in the last patch. It resulted in sub-optimal behavior as good connection may be closed.


-- 
To view, visit http://gerrit.cloudera.org:8080/7229
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8243d4cac93c453e9396b0e24f41e147c8637b8c
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5537: Retry RPC on somes exceptions with SSL connection

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-5537: Retry RPC on somes exceptions with SSL connection
......................................................................


Patch Set 5: Code-Review+2

(3 comments)

http://gerrit.cloudera.org:8080/#/c/7229/5/be/src/rpc/thrift-util.cc
File be/src/rpc/thrift-util.cc:

PS5, Line 198: This is not very robust as it's
             : // possible that the receiver of the RPC call may have received all the RPC payload
             : // but the ACK to the sender may have been dropped somehow. In which case, it's
             : // not safe to retry the RPC if it's not idempotent.
maybe reword this to say the caller should first check if the send was completed, or something? now that we do that.


PS5, Line 202: end-to-end tracking of RPC calls to detect duplicated calls in the receiver side
what does that mean? is it still applicable?


http://gerrit.cloudera.org:8080/#/c/7229/5/be/src/runtime/client-cache.h
File be/src/runtime/client-cache.h:

Line 330:     client_is_unrecoverable_ = false;
i guess this restores the behavior before the patch that introduced this RetryRpc() subroutine? okay.


-- 
To view, visit http://gerrit.cloudera.org:8080/7229
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8243d4cac93c453e9396b0e24f41e147c8637b8c
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5537: Retry RPC on somes exceptions with SSL connection

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Hello Henry Robinson, Sailesh Mukil, Dan Hecht,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7229

to look at the new patch set (#6).

Change subject: IMPALA-5537: Retry RPC on somes exceptions with SSL connection
......................................................................

IMPALA-5537: Retry RPC on somes exceptions with SSL connection

After the fix for IMPALA-5388, all TSSLException thrown will be
treated as fatal error and the query will fail. Turns out that
this is too strict and in a secure cluster under load, queries
can easily hit timeout waiting for RPC response.

When running without SSL, we call RetryRpcRecv() to retry the recv
part of an RPC if the TSocket underlying the RPC gets an EAGAIN
during recv(). This change extends that logic to cover secure
connection. In particular, we pattern match against the exception
string "SSL_read: Resource temporarily unavailable" which corresponds
to EAGAIN error code being thrown in the SSL_read() path.

Similarly, we will handle closed connection in send() path with
secure connection by pattern matching against the exception string
"TTransportException: Transport not open". To verify that the exception
is thrown during the send part of a RPC call, the RPC client interface
has been augmented to take a bool* argument which is set to true after
the send part of the RPC has completed but before the recv part starts.
If DoRPC() catches an exception and the send part isn't done yet, the
entire RPC if the exception string matches certain substrings which are
safe to retry.

The fault injection utility has also been updated to distinguish between
time out and lost connection to exercise different error handling paths
in the send and recv paths.

Change-Id: I8243d4cac93c453e9396b0e24f41e147c8637b8c
---
A be/src/catalog/catalog-service-client-wrapper.h
M be/src/exec/catalog-op-executor.cc
M be/src/rpc/thrift-server-test.cc
M be/src/rpc/thrift-util.cc
M be/src/runtime/backend-client.h
M be/src/runtime/client-cache-types.h
M be/src/runtime/client-cache.h
M be/src/service/client-request-state.cc
A be/src/statestore/statestore-service-client-wrapper.h
A be/src/statestore/statestore-subscriber-client-wrapper.h
M be/src/statestore/statestore-subscriber.cc
M be/src/statestore/statestore-subscriber.h
M be/src/statestore/statestore.cc
M be/src/statestore/statestore.h
M be/src/testutil/fault-injection-util.cc
M be/src/testutil/fault-injection-util.h
M tests/custom_cluster/test_rpc_exception.py
17 files changed, 397 insertions(+), 92 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/29/7229/6
-- 
To view, visit http://gerrit.cloudera.org:8080/7229
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8243d4cac93c453e9396b0e24f41e147c8637b8c
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>

[Impala-ASF-CR] IMPALA-5537: Retry RPC on somes exceptions with SSL connection

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-5537: Retry RPC on somes exceptions with SSL connection
......................................................................


Patch Set 7:

Build started: http://jenkins.impala.io:8080/job/gerrit-verify-dryrun/769/

-- 
To view, visit http://gerrit.cloudera.org:8080/7229
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8243d4cac93c453e9396b0e24f41e147c8637b8c
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5537: Retry RPC on somes exceptions with SSL connection

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged.

Change subject: IMPALA-5537: Retry RPC on somes exceptions with SSL connection
......................................................................


IMPALA-5537: Retry RPC on somes exceptions with SSL connection

After the fix for IMPALA-5388, all TSSLException thrown will be
treated as fatal error and the query will fail. Turns out that
this is too strict and in a secure cluster under load, queries
can easily hit timeout waiting for RPC response.

When running without SSL, we call RetryRpcRecv() to retry the recv
part of an RPC if the TSocket underlying the RPC gets an EAGAIN
during recv(). This change extends that logic to cover secure
connection. In particular, we pattern match against the exception
string "SSL_read: Resource temporarily unavailable" which corresponds
to EAGAIN error code being thrown in the SSL_read() path.

Similarly, we will handle closed connection in send() path with
secure connection by pattern matching against the exception string
"TTransportException: Transport not open". To verify that the exception
is thrown during the send part of a RPC call, the RPC client interface
has been augmented to take a bool* argument which is set to true after
the send part of the RPC has completed but before the recv part starts.
If DoRPC() catches an exception and the send part isn't done yet, the
entire RPC if the exception string matches certain substrings which are
safe to retry.

The fault injection utility has also been updated to distinguish between
time out and lost connection to exercise different error handling paths
in the send and recv paths.

Change-Id: I8243d4cac93c453e9396b0e24f41e147c8637b8c
Reviewed-on: http://gerrit.cloudera.org:8080/7229
Reviewed-by: Dan Hecht <dh...@cloudera.com>
Tested-by: Impala Public Jenkins
---
A be/src/catalog/catalog-service-client-wrapper.h
M be/src/exec/catalog-op-executor.cc
M be/src/rpc/thrift-server-test.cc
M be/src/rpc/thrift-util.cc
M be/src/runtime/backend-client.h
M be/src/runtime/client-cache-types.h
M be/src/runtime/client-cache.h
M be/src/service/client-request-state.cc
A be/src/statestore/statestore-service-client-wrapper.h
A be/src/statestore/statestore-subscriber-client-wrapper.h
M be/src/statestore/statestore-subscriber.cc
M be/src/statestore/statestore-subscriber.h
M be/src/statestore/statestore.cc
M be/src/statestore/statestore.h
M be/src/testutil/fault-injection-util.cc
M be/src/testutil/fault-injection-util.h
M tests/custom_cluster/test_rpc_exception.py
17 files changed, 425 insertions(+), 92 deletions(-)

Approvals:
  Impala Public Jenkins: Verified
  Dan Hecht: Looks good to me, approved



-- 
To view, visit http://gerrit.cloudera.org:8080/7229
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I8243d4cac93c453e9396b0e24f41e147c8637b8c
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>

[Impala-ASF-CR] IMPALA-5537: Retry RpcRecv on recv timeout exception for SSL connection

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-5537: Retry RpcRecv on recv timeout exception for SSL connection
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7229/1/be/src/rpc/thrift-util.cc
File be/src/rpc/thrift-util.cc:

PS1, Line 209: Called write on non-open socket"
> I agree, that would be safer - then we can be certain what phase the except
This comes with some complication as we need to update other RPC clients too. Should be doable.


-- 
To view, visit http://gerrit.cloudera.org:8080/7229
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8243d4cac93c453e9396b0e24f41e147c8637b8c
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5537: Retry RPC on somes exceptions with SSL connection

Posted by "Sailesh Mukil (Code Review)" <ge...@cloudera.org>.
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-5537: Retry RPC on somes exceptions with SSL connection
......................................................................


Patch Set 5: Code-Review+1

The code LGTM. Pending task is to only check if the secure stress test passes.

-- 
To view, visit http://gerrit.cloudera.org:8080/7229
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8243d4cac93c453e9396b0e24f41e147c8637b8c
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5537: Retry RPC on somes exceptions with SSL connection

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-5537: Retry RPC on somes exceptions with SSL connection
......................................................................


Patch Set 7: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/7229
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8243d4cac93c453e9396b0e24f41e147c8637b8c
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5537: Retry RpcRecv on recv timeout exception for SSL connection

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-5537: Retry RpcRecv on recv timeout exception for SSL connection
......................................................................


Patch Set 1:

(1 comment)

Henry, can you please see if the suggestion in thrift-util.cc makes sense ?

http://gerrit.cloudera.org:8080/#/c/7229/1/be/src/rpc/thrift-util.cc
File be/src/rpc/thrift-util.cc:

PS1, Line 209: Called write on non-open socket"
It seems that we still didn't handle the case of send failure for failed connection. Unfortunately, TSSLSocket.cpp returns a generic TTransportException::NOT_OPEN so it's hard to distinguish whether the exception is from SSL_write() or SSL_read().

A better solution would be to override all high level functions in ImpalaInternalServiceClient (similar to what we did for TransmitData()) and adds a flag to indicate the send part of an RPC is done. In this way, we can safely retry the entire RPC if the send part hasn't completed yet.


-- 
To view, visit http://gerrit.cloudera.org:8080/7229
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8243d4cac93c453e9396b0e24f41e147c8637b8c
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5537: Retry RPC on somes exceptions with SSL connection

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-5537: Retry RPC on somes exceptions with SSL connection
......................................................................


Patch Set 2:

PS2 augments the RPC client interfaces to take an extra bool* argument to indicate whether send has completed or not. Useful for determining whether it's safe to retry the entire RPC.

-- 
To view, visit http://gerrit.cloudera.org:8080/7229
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8243d4cac93c453e9396b0e24f41e147c8637b8c
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5537: Retry RPC on somes exceptions with SSL connection

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-5537: Retry RPC on somes exceptions with SSL connection
......................................................................


Patch Set 2:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/7229/2/be/src/catalog/catalog-service-client-wrapper.h
File be/src/catalog/catalog-service-client-wrapper.h:

Line 25: class CatalogServiceClientWrapper: public CatalogServiceClient {
> nit: space before :
Done


http://gerrit.cloudera.org:8080/#/c/7229/2/be/src/rpc/thrift-util.cc
File be/src/rpc/thrift-util.cc:

PS2, Line 207: ||
> Should this really be && ?
Oops. Must have been messed up in some previous changes which I reverted.


http://gerrit.cloudera.org:8080/#/c/7229/2/be/src/runtime/backend-client.h
File be/src/runtime/backend-client.h:

Line 113: 
> nit: extra blank line.
Done


http://gerrit.cloudera.org:8080/#/c/7229/2/be/src/runtime/client-cache-types.h
File be/src/runtime/client-cache-types.h:

PS2, Line 23: CatalogServiceClientWrapper
> replace line 38 with this one.
Done


http://gerrit.cloudera.org:8080/#/c/7229/2/be/src/runtime/client-cache.h
File be/src/runtime/client-cache.h:

PS2, Line 243: bool* retry_is_safe = NULL
> let's not do it now, but I think we should get rid of this parameter in fav
That's feasible but it would require passing in RuntimeState too to check for cancellation.


PS2, Line 319: bool send_done = false;
> nit: move inside try block
Done


http://gerrit.cloudera.org:8080/#/c/7229/2/be/src/statestore/statestore-service-client-wrapper.h
File be/src/statestore/statestore-service-client-wrapper.h:

PS2, Line 28:  
> space really needed? here and below
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7229
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8243d4cac93c453e9396b0e24f41e147c8637b8c
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5537: Retry RPC on somes exceptions with SSL connection

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has uploaded a new patch set (#2).

Change subject: IMPALA-5537: Retry RPC on somes exceptions with SSL connection
......................................................................

IMPALA-5537: Retry RPC on somes exceptions with SSL connection

After the fix for IMPALA-5388, all TSSLException thrown will be
treated as fatal error and the query will fail. Turns out that
this is too strict and in a secure cluster under load, queries
can easily hit timeout waiting for RPC response.

When running without SSL, we call RetryRpcRecv() to retry the recv
part of an RPC if the TSocket underlying the RPC gets an EAGAIN
during recv(). This change extends that logic to cover secure
connection. In particular, we pattern match against the exception
string "SSL_read: Resource temporarily unavailable" which corresponds
to EAGAIN error code being thrown in the SSL_read() path.

Similarly, we will handle closed connection in send() path with
secure connection by pattern matching against the exception string
"TTransportException: Transport not open". To verify that the exception
is thrown during the send part of a RPC call, the RPC client interface
has been augmented to take a bool* argument which is set to true after
the send part of the RPC has completed but before the recv part starts.
If DoRPC() catches an exception and the send part isn't done yet, the
entire RPC if the exception string matches certain substrings which are
safe to retry.

The fault injection utility has also been updated to distinguish between
time out and lost connection to exercise different error handling paths
in the send and recv paths.

Change-Id: I8243d4cac93c453e9396b0e24f41e147c8637b8c
---
A be/src/catalog/catalog-service-client-wrapper.h
M be/src/exec/catalog-op-executor.cc
M be/src/rpc/thrift-server-test.cc
M be/src/rpc/thrift-util.cc
M be/src/runtime/backend-client.h
M be/src/runtime/client-cache-types.h
M be/src/runtime/client-cache.h
M be/src/service/client-request-state.cc
A be/src/statestore/statestore-service-client-wrapper.h
A be/src/statestore/statestore-subscriber-client-wrapper.h
M be/src/statestore/statestore-subscriber.cc
M be/src/statestore/statestore-subscriber.h
M be/src/statestore/statestore.cc
M be/src/statestore/statestore.h
M be/src/testutil/fault-injection-util.cc
M be/src/testutil/fault-injection-util.h
M tests/custom_cluster/test_rpc_exception.py
17 files changed, 375 insertions(+), 74 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/29/7229/2
-- 
To view, visit http://gerrit.cloudera.org:8080/7229
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8243d4cac93c453e9396b0e24f41e147c8637b8c
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>

[Impala-ASF-CR] IMPALA-5537: Retry RPC on somes exceptions with SSL connection

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has uploaded a new patch set (#5).

Change subject: IMPALA-5537: Retry RPC on somes exceptions with SSL connection
......................................................................

IMPALA-5537: Retry RPC on somes exceptions with SSL connection

After the fix for IMPALA-5388, all TSSLException thrown will be
treated as fatal error and the query will fail. Turns out that
this is too strict and in a secure cluster under load, queries
can easily hit timeout waiting for RPC response.

When running without SSL, we call RetryRpcRecv() to retry the recv
part of an RPC if the TSocket underlying the RPC gets an EAGAIN
during recv(). This change extends that logic to cover secure
connection. In particular, we pattern match against the exception
string "SSL_read: Resource temporarily unavailable" which corresponds
to EAGAIN error code being thrown in the SSL_read() path.

Similarly, we will handle closed connection in send() path with
secure connection by pattern matching against the exception string
"TTransportException: Transport not open". To verify that the exception
is thrown during the send part of a RPC call, the RPC client interface
has been augmented to take a bool* argument which is set to true after
the send part of the RPC has completed but before the recv part starts.
If DoRPC() catches an exception and the send part isn't done yet, the
entire RPC if the exception string matches certain substrings which are
safe to retry.

The fault injection utility has also been updated to distinguish between
time out and lost connection to exercise different error handling paths
in the send and recv paths.

Change-Id: I8243d4cac93c453e9396b0e24f41e147c8637b8c
---
A be/src/catalog/catalog-service-client-wrapper.h
M be/src/exec/catalog-op-executor.cc
M be/src/rpc/thrift-server-test.cc
M be/src/rpc/thrift-util.cc
M be/src/runtime/backend-client.h
M be/src/runtime/client-cache-types.h
M be/src/runtime/client-cache.h
M be/src/service/client-request-state.cc
A be/src/statestore/statestore-service-client-wrapper.h
A be/src/statestore/statestore-subscriber-client-wrapper.h
M be/src/statestore/statestore-subscriber.cc
M be/src/statestore/statestore-subscriber.h
M be/src/statestore/statestore.cc
M be/src/statestore/statestore.h
M be/src/testutil/fault-injection-util.cc
M be/src/testutil/fault-injection-util.h
M tests/custom_cluster/test_rpc_exception.py
17 files changed, 391 insertions(+), 87 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/29/7229/5
-- 
To view, visit http://gerrit.cloudera.org:8080/7229
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8243d4cac93c453e9396b0e24f41e147c8637b8c
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>

[Impala-ASF-CR] IMPALA-5537: Retry RPC on somes exceptions with SSL connection

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-5537: Retry RPC on somes exceptions with SSL connection
......................................................................


Patch Set 7:

Turns off some clang-tidy warnings in the RPC client wrappers. We intentionally hide the auto-generated version of the client callstubs with this change.

-- 
To view, visit http://gerrit.cloudera.org:8080/7229
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8243d4cac93c453e9396b0e24f41e147c8637b8c
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5537: Retry RPC on somes exceptions with SSL connection

Posted by "Henry Robinson (Code Review)" <ge...@cloudera.org>.
Henry Robinson has posted comments on this change.

Change subject: IMPALA-5537: Retry RPC on somes exceptions with SSL connection
......................................................................


Patch Set 2:

(6 comments)

looks good. I'll have another think about it. Could you ask Sailesh to take a look as well, please?

http://gerrit.cloudera.org:8080/#/c/7229/2/be/src/catalog/catalog-service-client-wrapper.h
File be/src/catalog/catalog-service-client-wrapper.h:

Line 25: class CatalogServiceClientWrapper: public CatalogServiceClient {
nit: space before :


http://gerrit.cloudera.org:8080/#/c/7229/2/be/src/rpc/thrift-util.cc
File be/src/rpc/thrift-util.cc:

PS2, Line 207: ||
Should this really be && ?


http://gerrit.cloudera.org:8080/#/c/7229/2/be/src/runtime/client-cache-types.h
File be/src/runtime/client-cache-types.h:

PS2, Line 23: CatalogServiceClientWrapper
replace line 38 with this one.


http://gerrit.cloudera.org:8080/#/c/7229/2/be/src/runtime/client-cache.h
File be/src/runtime/client-cache.h:

PS2, Line 243: bool* retry_is_safe = NULL
let's not do it now, but I think we should get rid of this parameter in favour of an indicator that it's ok to retry recv (and have this method call RetryRpcRecv() itself). Leave a comment, if you agree?


PS2, Line 319: bool send_done = false;
nit: move inside try block


http://gerrit.cloudera.org:8080/#/c/7229/2/be/src/statestore/statestore-service-client-wrapper.h
File be/src/statestore/statestore-service-client-wrapper.h:

PS2, Line 28:  
space really needed? here and below


-- 
To view, visit http://gerrit.cloudera.org:8080/7229
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8243d4cac93c453e9396b0e24f41e147c8637b8c
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5537: Retry RPC on somes exceptions with SSL connection

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-5537: Retry RPC on somes exceptions with SSL connection
......................................................................


Patch Set 6:

Build started: http://jenkins.impala.io:8080/job/gerrit-verify-dryrun/766/

-- 
To view, visit http://gerrit.cloudera.org:8080/7229
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8243d4cac93c453e9396b0e24f41e147c8637b8c
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5537: Retry RPC on somes exceptions with SSL connection

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has uploaded a new patch set (#3).

Change subject: IMPALA-5537: Retry RPC on somes exceptions with SSL connection
......................................................................

IMPALA-5537: Retry RPC on somes exceptions with SSL connection

After the fix for IMPALA-5388, all TSSLException thrown will be
treated as fatal error and the query will fail. Turns out that
this is too strict and in a secure cluster under load, queries
can easily hit timeout waiting for RPC response.

When running without SSL, we call RetryRpcRecv() to retry the recv
part of an RPC if the TSocket underlying the RPC gets an EAGAIN
during recv(). This change extends that logic to cover secure
connection. In particular, we pattern match against the exception
string "SSL_read: Resource temporarily unavailable" which corresponds
to EAGAIN error code being thrown in the SSL_read() path.

Similarly, we will handle closed connection in send() path with
secure connection by pattern matching against the exception string
"TTransportException: Transport not open". To verify that the exception
is thrown during the send part of a RPC call, the RPC client interface
has been augmented to take a bool* argument which is set to true after
the send part of the RPC has completed but before the recv part starts.
If DoRPC() catches an exception and the send part isn't done yet, the
entire RPC if the exception string matches certain substrings which are
safe to retry.

The fault injection utility has also been updated to distinguish between
time out and lost connection to exercise different error handling paths
in the send and recv paths.

Change-Id: I8243d4cac93c453e9396b0e24f41e147c8637b8c
---
A be/src/catalog/catalog-service-client-wrapper.h
M be/src/exec/catalog-op-executor.cc
M be/src/rpc/thrift-server-test.cc
M be/src/rpc/thrift-util.cc
M be/src/runtime/backend-client.h
M be/src/runtime/client-cache-types.h
M be/src/runtime/client-cache.h
M be/src/service/client-request-state.cc
A be/src/statestore/statestore-service-client-wrapper.h
A be/src/statestore/statestore-subscriber-client-wrapper.h
M be/src/statestore/statestore-subscriber.cc
M be/src/statestore/statestore-subscriber.h
M be/src/statestore/statestore.cc
M be/src/statestore/statestore.h
M be/src/testutil/fault-injection-util.cc
M be/src/testutil/fault-injection-util.h
M tests/custom_cluster/test_rpc_exception.py
17 files changed, 387 insertions(+), 85 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/29/7229/3
-- 
To view, visit http://gerrit.cloudera.org:8080/7229
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8243d4cac93c453e9396b0e24f41e147c8637b8c
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>

[Impala-ASF-CR] IMPALA-5537: Retry RPC on somes exceptions with SSL connection

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-5537: Retry RPC on somes exceptions with SSL connection
......................................................................


Patch Set 2:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/7229/2/be/src/rpc/thrift-util.cc
File be/src/rpc/thrift-util.cc:

PS2, Line 208: // "TTransportException: Transport not open" is from TSSLSocket.cpp
> I don't think this comment is necessary. This could be from TSocket.cpp too
Actually, this specific exception message shows up only when a caller calls

throw TTransportException(TTransportException::NOT_OPEN);

I grep'ed the thrift library code and there are two files which have the pattern above: TSocketPool.cpp and TSSLSocket.cpp


http://gerrit.cloudera.org:8080/#/c/7229/2/be/src/runtime/client-cache.h
File be/src/runtime/client-cache.h:

Line 328:     return Status::OK();
> Not from this change, but I think we should set client_is_unrecoverable_=fa
Done


http://gerrit.cloudera.org:8080/#/c/7229/2/tests/custom_cluster/test_rpc_exception.py
File tests/custom_cluster/test_rpc_exception.py:

PS2, Line 46: execute_query
> Not from this change, but change to execute_test_query().
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7229
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8243d4cac93c453e9396b0e24f41e147c8637b8c
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5537: Retry RpcRecv on recv timeout exception for SSL connection

Posted by "Sailesh Mukil (Code Review)" <ge...@cloudera.org>.
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-5537: Retry RpcRecv on recv timeout exception for SSL connection
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7229/1//COMMIT_MSG
Commit Message:

PS1, Line 24: After this change, we will still fail the RPC if there is
            : any error in the send part of a RPC with secure connection.
I'm still slightly worried about this. Our client cache could previously potentially have connections that were already broken for a variety of reasons.

Looking at the code now, it seems like it shouldn't be possible, because if an RPC fails, the client_is_unrecoverable_ is set to 'true' and it gets destroyed, but I'm not sure if there is some edge case where a broken connection would be put back into the cache, or worse, a connection gets broken while in the cache (due to a timeout or something similar).

If we get a broken connection from the cache, we would fail the RPC without even trying it, as our code would view that as a send() failure.

I'll look at it in some more detail to see if and how the above case could happen.


-- 
To view, visit http://gerrit.cloudera.org:8080/7229
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8243d4cac93c453e9396b0e24f41e147c8637b8c
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5537: Retry RPC on somes exceptions with SSL connection

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-5537: Retry RPC on somes exceptions with SSL connection
......................................................................


Patch Set 6: Code-Review+2

Carry +2.

-- 
To view, visit http://gerrit.cloudera.org:8080/7229
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8243d4cac93c453e9396b0e24f41e147c8637b8c
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5537: Retry RPC on somes exceptions with SSL connection

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has uploaded a new patch set (#4).

Change subject: IMPALA-5537: Retry RPC on somes exceptions with SSL connection
......................................................................

IMPALA-5537: Retry RPC on somes exceptions with SSL connection

After the fix for IMPALA-5388, all TSSLException thrown will be
treated as fatal error and the query will fail. Turns out that
this is too strict and in a secure cluster under load, queries
can easily hit timeout waiting for RPC response.

When running without SSL, we call RetryRpcRecv() to retry the recv
part of an RPC if the TSocket underlying the RPC gets an EAGAIN
during recv(). This change extends that logic to cover secure
connection. In particular, we pattern match against the exception
string "SSL_read: Resource temporarily unavailable" which corresponds
to EAGAIN error code being thrown in the SSL_read() path.

Similarly, we will handle closed connection in send() path with
secure connection by pattern matching against the exception string
"TTransportException: Transport not open". To verify that the exception
is thrown during the send part of a RPC call, the RPC client interface
has been augmented to take a bool* argument which is set to true after
the send part of the RPC has completed but before the recv part starts.
If DoRPC() catches an exception and the send part isn't done yet, the
entire RPC if the exception string matches certain substrings which are
safe to retry.

The fault injection utility has also been updated to distinguish between
time out and lost connection to exercise different error handling paths
in the send and recv paths.

Change-Id: I8243d4cac93c453e9396b0e24f41e147c8637b8c
---
A be/src/catalog/catalog-service-client-wrapper.h
M be/src/exec/catalog-op-executor.cc
M be/src/rpc/thrift-server-test.cc
M be/src/rpc/thrift-util.cc
M be/src/runtime/backend-client.h
M be/src/runtime/client-cache-types.h
M be/src/runtime/client-cache.h
M be/src/service/client-request-state.cc
A be/src/statestore/statestore-service-client-wrapper.h
A be/src/statestore/statestore-subscriber-client-wrapper.h
M be/src/statestore/statestore-subscriber.cc
M be/src/statestore/statestore-subscriber.h
M be/src/statestore/statestore.cc
M be/src/statestore/statestore.h
M be/src/testutil/fault-injection-util.cc
M be/src/testutil/fault-injection-util.h
M tests/custom_cluster/test_rpc_exception.py
17 files changed, 387 insertions(+), 85 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/29/7229/4
-- 
To view, visit http://gerrit.cloudera.org:8080/7229
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8243d4cac93c453e9396b0e24f41e147c8637b8c
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>

[Impala-ASF-CR] IMPALA-5537: Retry RPC on somes exceptions with SSL connection

Posted by "Henry Robinson (Code Review)" <ge...@cloudera.org>.
Henry Robinson has posted comments on this change.

Change subject: IMPALA-5537: Retry RPC on somes exceptions with SSL connection
......................................................................


Patch Set 5: Code-Review+1

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7229/5/be/src/testutil/fault-injection-util.cc
File be/src/testutil/fault-injection-util.cc:

PS5, Line 78: 
Was this string wrong before, or did something change? If it was wrong, were the tests silently passing?


-- 
To view, visit http://gerrit.cloudera.org:8080/7229
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8243d4cac93c453e9396b0e24f41e147c8637b8c
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5537: Retry RPC on somes exceptions with SSL connection

Posted by "Sailesh Mukil (Code Review)" <ge...@cloudera.org>.
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-5537: Retry RPC on somes exceptions with SSL connection
......................................................................


Patch Set 4:

(3 comments)

Thanks for doing this. It would be good to get a secure stress run with this patch too.

http://gerrit.cloudera.org:8080/#/c/7229/2/be/src/rpc/thrift-util.cc
File be/src/rpc/thrift-util.cc:

PS2, Line 208: // "TTransportException: Transport not open" is from TSSLSocket.cpp
I don't think this comment is necessary. This could be from TSocket.cpp too.


http://gerrit.cloudera.org:8080/#/c/7229/2/be/src/runtime/client-cache.h
File be/src/runtime/client-cache.h:

Line 328:       return Status(TErrorCode::RPC_GENERAL_ERROR, ExceptionMsg(e));
Not from this change, but I think we should set client_is_unrecoverable_=false before returning here. Else, we'll be destroying a "good" connection unnecessarily.


http://gerrit.cloudera.org:8080/#/c/7229/2/tests/custom_cluster/test_rpc_exception.py
File tests/custom_cluster/test_rpc_exception.py:

PS2, Line 46: execute_query
Not from this change, but change to execute_test_query().

execute_query() is another function implemented by the ImpalaTestSuite and we'd rather avoid confusion.


-- 
To view, visit http://gerrit.cloudera.org:8080/7229
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8243d4cac93c453e9396b0e24f41e147c8637b8c
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5537: Retry RPC on somes exceptions with SSL connection

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-5537: Retry RPC on somes exceptions with SSL connection
......................................................................


Patch Set 7: Verified+1

-- 
To view, visit http://gerrit.cloudera.org:8080/7229
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8243d4cac93c453e9396b0e24f41e147c8637b8c
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5537: Retry RPC on somes exceptions with SSL connection

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-5537: Retry RPC on somes exceptions with SSL connection
......................................................................


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7229/5/be/src/testutil/fault-injection-util.cc
File be/src/testutil/fault-injection-util.cc:

PS5, Line 78: 
> Was this string wrong before, or did something change? If it was wrong, wer
It didn't match the exception message from thrift but the query was expected to fail anyway as we never handled any kind of TSSLException before.


-- 
To view, visit http://gerrit.cloudera.org:8080/7229
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8243d4cac93c453e9396b0e24f41e147c8637b8c
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5537: Retry RpcRecv on recv timeout exception for SSL connection

Posted by "Henry Robinson (Code Review)" <ge...@cloudera.org>.
Henry Robinson has posted comments on this change.

Change subject: IMPALA-5537: Retry RpcRecv on recv timeout exception for SSL connection
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7229/1/be/src/rpc/thrift-util.cc
File be/src/rpc/thrift-util.cc:

PS1, Line 209: Called write on non-open socket"
> It seems that we still didn't handle the case of send failure for failed co
I agree, that would be safer - then we can be certain what phase the exceptions get thrown in.


-- 
To view, visit http://gerrit.cloudera.org:8080/7229
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8243d4cac93c453e9396b0e24f41e147c8637b8c
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5537: Retry RpcRecv on recv timeout exception for SSL connection

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-5537: Retry RpcRecv on recv timeout exception for SSL connection
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7229/1//COMMIT_MSG
Commit Message:

PS1, Line 24: After this change, we will still fail the RPC if there is
            : any error in the send part of a RPC with secure connection.
> I'm still slightly worried about this. Our client cache could previously po
Your concern is valid. I noticed the following in the log message of IMPALA-5537:

f27a50000001d #in-flight=27 status=RPC_GENERAL_ERROR: RPC Error: Client for <some-host>:22000 hits an unexpected exception: TTransportException: Transport not open, type: N6apache6thrift9transport19TTransportExceptionE

That's why I think the flag suggested in thrift-util.cc is needed as the exception thrown by TSSLSocket.cpp cannot help us tell between whether send has completed or not when the exception occurs.


-- 
To view, visit http://gerrit.cloudera.org:8080/7229
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8243d4cac93c453e9396b0e24f41e147c8637b8c
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5537: Retry RPC on somes exceptions with SSL connection

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-5537: Retry RPC on somes exceptions with SSL connection
......................................................................


Patch Set 2:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/7229/2/be/src/rpc/thrift-server-test.cc
File be/src/rpc/thrift-server-test.cc:

Line 104:     ssl_client.iface()->RegisterSubscriber(resp, TRegisterSubscriberRequest(), &send_done);
nit: long line.


Line 119:   ThriftClient<StatestoreServiceClientWrapper> ssl_client("localhost", port, "", NULL, true);
nit: long line.


Line 141:   ThriftClient<StatestoreServiceClientWrapper> ssl_client("localhost", port, "", NULL, true);
nit: long line


Line 169:   ThriftClient<StatestoreServiceClientWrapper> ssl_client("localhost", port, "", NULL, true);
nit: long line


Line 216:     ssl_client.iface()->RegisterSubscriber(resp, TRegisterSubscriberRequest(), &send_done);
nit: long line


Line 222:     ssl_client.iface()->RegisterSubscriber(resp, TRegisterSubscriberRequest(), &send_done);
nit: long line


http://gerrit.cloudera.org:8080/#/c/7229/2/be/src/rpc/thrift-util.cc
File be/src/rpc/thrift-util.cc:

PS2, Line 209: "TTransportException: Transport not open"
Note that this is what TTransportExcept::what() returns when the exception object was created with empty messages.


http://gerrit.cloudera.org:8080/#/c/7229/2/be/src/runtime/backend-client.h
File be/src/runtime/backend-client.h:

Line 113: 
nit: extra blank line.


-- 
To view, visit http://gerrit.cloudera.org:8080/7229
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8243d4cac93c453e9396b0e24f41e147c8637b8c
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5537: Retry RPC on somes exceptions with SSL connection

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Hello Henry Robinson, Sailesh Mukil, Dan Hecht,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7229

to look at the new patch set (#7).

Change subject: IMPALA-5537: Retry RPC on somes exceptions with SSL connection
......................................................................

IMPALA-5537: Retry RPC on somes exceptions with SSL connection

After the fix for IMPALA-5388, all TSSLException thrown will be
treated as fatal error and the query will fail. Turns out that
this is too strict and in a secure cluster under load, queries
can easily hit timeout waiting for RPC response.

When running without SSL, we call RetryRpcRecv() to retry the recv
part of an RPC if the TSocket underlying the RPC gets an EAGAIN
during recv(). This change extends that logic to cover secure
connection. In particular, we pattern match against the exception
string "SSL_read: Resource temporarily unavailable" which corresponds
to EAGAIN error code being thrown in the SSL_read() path.

Similarly, we will handle closed connection in send() path with
secure connection by pattern matching against the exception string
"TTransportException: Transport not open". To verify that the exception
is thrown during the send part of a RPC call, the RPC client interface
has been augmented to take a bool* argument which is set to true after
the send part of the RPC has completed but before the recv part starts.
If DoRPC() catches an exception and the send part isn't done yet, the
entire RPC if the exception string matches certain substrings which are
safe to retry.

The fault injection utility has also been updated to distinguish between
time out and lost connection to exercise different error handling paths
in the send and recv paths.

Change-Id: I8243d4cac93c453e9396b0e24f41e147c8637b8c
---
A be/src/catalog/catalog-service-client-wrapper.h
M be/src/exec/catalog-op-executor.cc
M be/src/rpc/thrift-server-test.cc
M be/src/rpc/thrift-util.cc
M be/src/runtime/backend-client.h
M be/src/runtime/client-cache-types.h
M be/src/runtime/client-cache.h
M be/src/service/client-request-state.cc
A be/src/statestore/statestore-service-client-wrapper.h
A be/src/statestore/statestore-subscriber-client-wrapper.h
M be/src/statestore/statestore-subscriber.cc
M be/src/statestore/statestore-subscriber.h
M be/src/statestore/statestore.cc
M be/src/statestore/statestore.h
M be/src/testutil/fault-injection-util.cc
M be/src/testutil/fault-injection-util.h
M tests/custom_cluster/test_rpc_exception.py
17 files changed, 425 insertions(+), 92 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/29/7229/7
-- 
To view, visit http://gerrit.cloudera.org:8080/7229
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8243d4cac93c453e9396b0e24f41e147c8637b8c
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>