You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Wenzhe Zhou (Code Review)" <ge...@cloudera.org> on 2022/04/21 22:39:06 UTC

[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query

Wenzhe Zhou has uploaded this change for review. ( http://gerrit.cloudera.org:8080/18439


Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................

IMPALA-11263: Coordinator hang when cancelling a query

In a rare case, callback Coordinator::BackendState::ExecCompleteCb()
was not called for the corresponding ExecQueryFInstances RPC. This
caused coordinator to wait indefinitely when calling
Coordinator::BackendState::Cancel() to cancel one fragment instance.

This patch added timeout for BackendState::WaitOnExecLocked()
so that coordinator will not be blocked indefinitely when cancelling
a query.

Testing:
 - Added a test case to simulate the callback missing when a query
   is failed. Verified that the coordinator would hang without
   the fixing.
 - Passed core tests.

Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
---
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M tests/custom_cluster/test_rpc_timeout.py
3 files changed, 154 insertions(+), 96 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/18439/1
-- 
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 1
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>

[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18439 )

Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................


Patch Set 5:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10510/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 5
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Comment-Date: Fri, 29 Apr 2022 00:42:07 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/18439 )

Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................

IMPALA-11263: Coordinator hang when cancelling a query

In a rare case, callback Coordinator::BackendState::ExecCompleteCb()
is not called for the corresponding ExecQueryFInstances RPC when the
RPC is cancelled. This causes coordinator to wait indefinitely when
calling Coordinator::BackendState::Cancel() to cancel a fragment
instance.

This patch adds timeout for BackendState::WaitOnExecLocked() so that
coordinator will not be blocked indefinitely when cancelling a query.

Testing:
 - Added a test case to simulate the callback missing when a query
   is failed. Verified that the coordinator would hang without the
   fixing, and would not hang with the fixing.
 - Passed exhaustive-debug tests.

Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Reviewed-on: http://gerrit.cloudera.org:8080/18439
Reviewed-by: Joe McDonnell <jo...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M tests/custom_cluster/test_rpc_timeout.py
3 files changed, 155 insertions(+), 97 deletions(-)

Approvals:
  Joe McDonnell: Looks good to me, approved
  Impala Public Jenkins: Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 6
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>

[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query

Posted by "Wenzhe Zhou (Code Review)" <ge...@cloudera.org>.
Wenzhe Zhou has posted comments on this change. ( http://gerrit.cloudera.org:8080/18439 )

Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................


Patch Set 4:

(3 comments)

Did code analysis for KRPC and found the root cause (see the comments in the IMPALA-11263). Will file a separate Jira to track KRPC issue.

http://gerrit.cloudera.org:8080/#/c/18439/3/be/src/runtime/coordinator-backend-state.cc
File be/src/runtime/coordinator-backend-state.cc:

http://gerrit.cloudera.org:8080/#/c/18439/3/be/src/runtime/coordinator-backend-state.cc@204
PS3, Line 204: if (!
> That's a really good point. I missed that this while loop is now not really
fixed


http://gerrit.cloudera.org:8080/#/c/18439/3/be/src/runtime/coordinator-backend-state.cc@622
PS3, Line 622: l();
> nit. "Exec() rpc was not responsive after waiting for $0 ms" probably descr
fixed


http://gerrit.cloudera.org:8080/#/c/18439/3/be/src/runtime/coordinator-backend-state.cc@621
PS3, Line 621:       cancel_exec_rpc_ = true;
             :       exec_rpc_controller_.Cancel();
             :       if (!WaitOnExecLocked(&l, (int64_t)FLAGS_backend_client_rpc_timeout_ms)) {
             :         VLogForBackend(Substitute(
             :             "Exec() rpc was not 
> Just thinking this through: if this times out, then we will fall through an
Right, we are not sure the state of backend if it's timeout. It's better to send cancellation RPC to backend.



-- 
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 4
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Comment-Date: Wed, 27 Apr 2022 23:07:21 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18439 )

Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................


Patch Set 1:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10477/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 1
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Thu, 21 Apr 2022 22:58:58 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query

Posted by "Wenzhe Zhou (Code Review)" <ge...@cloudera.org>.
Wenzhe Zhou has uploaded a new patch set (#2). ( http://gerrit.cloudera.org:8080/18439 )

Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................

IMPALA-11263: Coordinator hang when cancelling a query

In a rare case, callback Coordinator::BackendState::ExecCompleteCb()
was not called for the corresponding ExecQueryFInstances RPC. This
caused coordinator to wait indefinitely when calling
Coordinator::BackendState::Cancel() to cancel one fragment instance.

This patch added timeout for BackendState::WaitOnExecLocked()
so that coordinator will not be blocked indefinitely when cancelling
a query.

Testing:
 - Added a test case to simulate the callback missing when a query
   is failed. Verified that the coordinator would hang without
   the fixing.
 - Passed core tests.

Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
---
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M tests/custom_cluster/test_rpc_timeout.py
3 files changed, 154 insertions(+), 96 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/18439/2
-- 
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 2
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18439 )

Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................


Patch Set 4:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10501/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 4
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Comment-Date: Wed, 27 Apr 2022 23:22:55 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18439 )

Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................


Patch Set 3:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10489/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 3
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Comment-Date: Mon, 25 Apr 2022 16:49:17 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query

Posted by "Wenzhe Zhou (Code Review)" <ge...@cloudera.org>.
Wenzhe Zhou has uploaded a new patch set (#4). ( http://gerrit.cloudera.org:8080/18439 )

Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................

IMPALA-11263: Coordinator hang when cancelling a query

In a rare case, callback Coordinator::BackendState::ExecCompleteCb()
is not called for the corresponding ExecQueryFInstances RPC when the
RPC is cancelled. This causes coordinator to wait indefinitely when
calling Coordinator::BackendState::Cancel() to cancel a fragment
instance.

This patch adds timeout for BackendState::WaitOnExecLocked() so that
coordinator will not be blocked indefinitely when cancelling a query.

Testing:
 - Added a test case to simulate the callback missing when a query
   is failed. Verified that the coordinator would hang without the
   fixing, and would not hang with the fixing.
 - Passed exhaustive-debug tests.

Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
---
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M tests/custom_cluster/test_rpc_timeout.py
3 files changed, 157 insertions(+), 97 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/18439/4
-- 
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 4
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>

[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18439 )

Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................


Patch Set 5: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 5
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Comment-Date: Sat, 30 Apr 2022 03:23:10 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query

Posted by "Wenzhe Zhou (Code Review)" <ge...@cloudera.org>.
Wenzhe Zhou has posted comments on this change. ( http://gerrit.cloudera.org:8080/18439 )

Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/18439/1/tests/custom_cluster/test_rpc_timeout.py
File tests/custom_cluster/test_rpc_timeout.py:

http://gerrit.cloudera.org:8080/#/c/18439/1/tests/custom_cluster/test_rpc_timeout.py@211
PS1, Line 211: _
> flake8: E131 continuation line unaligned for hanging indent
fixed



-- 
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 2
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Comment-Date: Thu, 21 Apr 2022 23:11:02 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/18439 )

Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/18439/3/be/src/runtime/coordinator-backend-state.cc
File be/src/runtime/coordinator-backend-state.cc:

http://gerrit.cloudera.org:8080/#/c/18439/3/be/src/runtime/coordinator-backend-state.cc@204
PS3, Line 204: while
> nit. Use of 'IF' here probably is better for understanding the code.
That's a really good point. I missed that this while loop is now not really a while loop.

We could also add a DCHECK after the exec_done_cv_.Wait(*l) line.



-- 
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 3
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Comment-Date: Wed, 27 Apr 2022 16:28:28 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18439 )

Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................


Patch Set 5:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/8071/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 5
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Comment-Date: Fri, 29 Apr 2022 18:26:09 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query

Posted by "Wenzhe Zhou (Code Review)" <ge...@cloudera.org>.
Wenzhe Zhou has posted comments on this change. ( http://gerrit.cloudera.org:8080/18439 )

Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/18439/4/be/src/runtime/coordinator-backend-state.cc
File be/src/runtime/coordinator-backend-state.cc:

http://gerrit.cloudera.org:8080/#/c/18439/4/be/src/runtime/coordinator-backend-state.cc@204
PS4, Line 204:   if (!exec_done_) {
             :     if (timeout_ms <= 0) {
             :       exec_done_cv_.Wait(*l);
             :       DCHECK(exec_done_);
             :       return true;
             :     } else {
             :       bool notified = exec_done_cv_.WaitFor(*l, timeout_ms * MICROS_PER_MILLI);
             :       if (notified) DCHECK(exec_done_);
             :       return notified;
             :     }
             :   }
> After reading our existing uses of condition variables, I'm changing my min
Thanks for your detail comments. Will fix it as suggested



-- 
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 4
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Comment-Date: Fri, 29 Apr 2022 00:17:17 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/18439 )

Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................


Patch Set 5: Code-Review+2

Thanks, this looks good to me.


-- 
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 5
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Comment-Date: Fri, 29 Apr 2022 18:15:24 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/18439 )

Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................


Patch Set 3:

(1 comment)

This is looking good. I just had one comment to check my understanding.

http://gerrit.cloudera.org:8080/#/c/18439/3/be/src/runtime/coordinator-backend-state.cc
File be/src/runtime/coordinator-backend-state.cc:

http://gerrit.cloudera.org:8080/#/c/18439/3/be/src/runtime/coordinator-backend-state.cc@621
PS3, Line 621:         VLogForBackend(Substitute(
             :             "Exec() rpc was not completed or cancelled after waiting for $0 ms",
             :             FLAGS_backend_client_rpc_timeout_ms));
             :         exec_done_ = true;
             :         notify_exec_done = true;
Just thinking this through: if this times out, then we will fall through and send a cancel RPC at line 679. That seems reasonable, because we can't be sure the RPC got cancelled.



-- 
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 3
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Comment-Date: Tue, 26 Apr 2022 23:04:27 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18439 )

Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................


Patch Set 5:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/8072/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 5
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Comment-Date: Fri, 29 Apr 2022 23:02:50 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18439 )

Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................


Patch Set 2:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10478/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 2
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Thu, 21 Apr 2022 23:08:13 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/18439 )

Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/18439/4/be/src/runtime/coordinator-backend-state.cc
File be/src/runtime/coordinator-backend-state.cc:

http://gerrit.cloudera.org:8080/#/c/18439/4/be/src/runtime/coordinator-backend-state.cc@204
PS4, Line 204:   if (!exec_done_) {
             :     if (timeout_ms <= 0) {
             :       exec_done_cv_.Wait(*l);
             :       DCHECK(exec_done_);
             :       return true;
             :     } else {
             :       bool notified = exec_done_cv_.WaitFor(*l, timeout_ms * MICROS_PER_MILLI);
             :       if (notified) DCHECK(exec_done_);
             :       return notified;
             :     }
             :   }
After reading our existing uses of condition variables, I'm changing my mind. It seems like it is safer if we can handle spurious wakeups by using a while loop. All our other code calls Wait() in a loop. Sorry I missed that before, we should probably do something similar to this:

bool timed_out = false;
while (!exec_done_ && !timed_out) {
  if (timeout_ms <= 0) {
    exec_done_cv_.Wait(*l);
  } else {
    // WaitFor returns true if notified
    timed_out = !exec_done_cv_.WaitFor(*l, timeout_ms * MICROS_PER_MILLI);
  }
}
return !timed_out;

We may want to track the time for the second case and sleep the remainder.



-- 
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 4
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Comment-Date: Thu, 28 Apr 2022 22:01:41 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query

Posted by "Wenzhe Zhou (Code Review)" <ge...@cloudera.org>.
Wenzhe Zhou has uploaded a new patch set (#5). ( http://gerrit.cloudera.org:8080/18439 )

Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................

IMPALA-11263: Coordinator hang when cancelling a query

In a rare case, callback Coordinator::BackendState::ExecCompleteCb()
is not called for the corresponding ExecQueryFInstances RPC when the
RPC is cancelled. This causes coordinator to wait indefinitely when
calling Coordinator::BackendState::Cancel() to cancel a fragment
instance.

This patch adds timeout for BackendState::WaitOnExecLocked() so that
coordinator will not be blocked indefinitely when cancelling a query.

Testing:
 - Added a test case to simulate the callback missing when a query
   is failed. Verified that the coordinator would hang without the
   fixing, and would not hang with the fixing.
 - Passed exhaustive-debug tests.

Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
---
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M tests/custom_cluster/test_rpc_timeout.py
3 files changed, 155 insertions(+), 97 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/18439/5
-- 
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 5
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>

[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/18439 )

Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................


Patch Set 3:

(2 comments)

Looks good to me.

http://gerrit.cloudera.org:8080/#/c/18439/3/be/src/runtime/coordinator-backend-state.cc
File be/src/runtime/coordinator-backend-state.cc:

http://gerrit.cloudera.org:8080/#/c/18439/3/be/src/runtime/coordinator-backend-state.cc@204
PS3, Line 204: while
nit. Use of 'IF' here probably is better for understanding the code.


http://gerrit.cloudera.org:8080/#/c/18439/3/be/src/runtime/coordinator-backend-state.cc@622
PS3, Line 622: completed or cancelled
nit. "Exec() rpc was not responsive after waiting for $0 ms" probably describes the state more precisely.



-- 
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 3
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Comment-Date: Wed, 27 Apr 2022 15:01:54 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query

Posted by "Wenzhe Zhou (Code Review)" <ge...@cloudera.org>.
Wenzhe Zhou has uploaded a new patch set (#3). ( http://gerrit.cloudera.org:8080/18439 )

Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................

IMPALA-11263: Coordinator hang when cancelling a query

In a rare case, callback Coordinator::BackendState::ExecCompleteCb()
is not called for the corresponding ExecQueryFInstances RPC when the
RPC is cancelled. This causes coordinator to wait indefinitely when
calling Coordinator::BackendState::Cancel() to cancel a fragment
instance.

This patch adds timeout for BackendState::WaitOnExecLocked() so that
coordinator will not be blocked indefinitely when cancelling a query.

Testing:
 - Added a test case to simulate the callback missing when a query
   is failed. Verified that the coordinator would hang without the
   fixing, and would not hang with the fixing.
 - Passed exhaustive-debug tests.

Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
---
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M tests/custom_cluster/test_rpc_timeout.py
3 files changed, 153 insertions(+), 96 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/18439/3
-- 
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 3
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>

[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18439 )

Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................


Patch Set 5: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/8071/


-- 
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 5
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Comment-Date: Fri, 29 Apr 2022 22:51:52 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query

Posted by "Wenzhe Zhou (Code Review)" <ge...@cloudera.org>.
Wenzhe Zhou has posted comments on this change. ( http://gerrit.cloudera.org:8080/18439 )

Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................


Patch Set 5:

Hit flaky test IMPALA-10927 and IMPALA-11276


-- 
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 5
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Comment-Date: Fri, 29 Apr 2022 23:01:48 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18439 )

Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/18439/1/tests/custom_cluster/test_rpc_timeout.py
File tests/custom_cluster/test_rpc_timeout.py:

http://gerrit.cloudera.org:8080/#/c/18439/1/tests/custom_cluster/test_rpc_timeout.py@211
PS1, Line 211: _
flake8: E131 continuation line unaligned for hanging indent



-- 
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 1
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Thu, 21 Apr 2022 22:39:56 +0000
Gerrit-HasComments: Yes