You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Wenzhe Zhou (Code Review)" <ge...@cloudera.org> on 2022/04/21 22:39:06 UTC
[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query
Wenzhe Zhou has uploaded this change for review. ( http://gerrit.cloudera.org:8080/18439
Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................
IMPALA-11263: Coordinator hang when cancelling a query
In a rare case, callback Coordinator::BackendState::ExecCompleteCb()
was not called for the corresponding ExecQueryFInstances RPC. This
caused coordinator to wait indefinitely when calling
Coordinator::BackendState::Cancel() to cancel one fragment instance.
This patch added timeout for BackendState::WaitOnExecLocked()
so that coordinator will not be blocked indefinitely when cancelling
a query.
Testing:
- Added a test case to simulate the callback missing when a query
is failed. Verified that the coordinator would hang without
the fixing.
- Passed core tests.
Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
---
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M tests/custom_cluster/test_rpc_timeout.py
3 files changed, 154 insertions(+), 96 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/18439/1
--
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 1
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query
Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18439 )
Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................
Patch Set 5:
Build Successful
https://jenkins.impala.io/job/gerrit-code-review-checks/10510/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.
--
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 5
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Comment-Date: Fri, 29 Apr 2022 00:42:07 +0000
Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query
Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/18439 )
Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................
IMPALA-11263: Coordinator hang when cancelling a query
In a rare case, callback Coordinator::BackendState::ExecCompleteCb()
is not called for the corresponding ExecQueryFInstances RPC when the
RPC is cancelled. This causes coordinator to wait indefinitely when
calling Coordinator::BackendState::Cancel() to cancel a fragment
instance.
This patch adds timeout for BackendState::WaitOnExecLocked() so that
coordinator will not be blocked indefinitely when cancelling a query.
Testing:
- Added a test case to simulate the callback missing when a query
is failed. Verified that the coordinator would hang without the
fixing, and would not hang with the fixing.
- Passed exhaustive-debug tests.
Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Reviewed-on: http://gerrit.cloudera.org:8080/18439
Reviewed-by: Joe McDonnell <jo...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M tests/custom_cluster/test_rpc_timeout.py
3 files changed, 155 insertions(+), 97 deletions(-)
Approvals:
Joe McDonnell: Looks good to me, approved
Impala Public Jenkins: Verified
--
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 6
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query
Posted by "Wenzhe Zhou (Code Review)" <ge...@cloudera.org>.
Wenzhe Zhou has posted comments on this change. ( http://gerrit.cloudera.org:8080/18439 )
Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................
Patch Set 4:
(3 comments)
Did code analysis for KRPC and found the root cause (see the comments in the IMPALA-11263). Will file a separate Jira to track KRPC issue.
http://gerrit.cloudera.org:8080/#/c/18439/3/be/src/runtime/coordinator-backend-state.cc
File be/src/runtime/coordinator-backend-state.cc:
http://gerrit.cloudera.org:8080/#/c/18439/3/be/src/runtime/coordinator-backend-state.cc@204
PS3, Line 204: if (!
> That's a really good point. I missed that this while loop is now not really
fixed
http://gerrit.cloudera.org:8080/#/c/18439/3/be/src/runtime/coordinator-backend-state.cc@622
PS3, Line 622: l();
> nit. "Exec() rpc was not responsive after waiting for $0 ms" probably descr
fixed
http://gerrit.cloudera.org:8080/#/c/18439/3/be/src/runtime/coordinator-backend-state.cc@621
PS3, Line 621: cancel_exec_rpc_ = true;
: exec_rpc_controller_.Cancel();
: if (!WaitOnExecLocked(&l, (int64_t)FLAGS_backend_client_rpc_timeout_ms)) {
: VLogForBackend(Substitute(
: "Exec() rpc was not
> Just thinking this through: if this times out, then we will fall through an
Right, we are not sure the state of backend if it's timeout. It's better to send cancellation RPC to backend.
--
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 4
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Comment-Date: Wed, 27 Apr 2022 23:07:21 +0000
Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query
Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18439 )
Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................
Patch Set 1:
Build Successful
https://jenkins.impala.io/job/gerrit-code-review-checks/10477/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.
--
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 1
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Thu, 21 Apr 2022 22:58:58 +0000
Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query
Posted by "Wenzhe Zhou (Code Review)" <ge...@cloudera.org>.
Wenzhe Zhou has uploaded a new patch set (#2). ( http://gerrit.cloudera.org:8080/18439 )
Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................
IMPALA-11263: Coordinator hang when cancelling a query
In a rare case, callback Coordinator::BackendState::ExecCompleteCb()
was not called for the corresponding ExecQueryFInstances RPC. This
caused coordinator to wait indefinitely when calling
Coordinator::BackendState::Cancel() to cancel one fragment instance.
This patch added timeout for BackendState::WaitOnExecLocked()
so that coordinator will not be blocked indefinitely when cancelling
a query.
Testing:
- Added a test case to simulate the callback missing when a query
is failed. Verified that the coordinator would hang without
the fixing.
- Passed core tests.
Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
---
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M tests/custom_cluster/test_rpc_timeout.py
3 files changed, 154 insertions(+), 96 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/18439/2
--
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 2
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query
Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18439 )
Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................
Patch Set 4:
Build Successful
https://jenkins.impala.io/job/gerrit-code-review-checks/10501/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.
--
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 4
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Comment-Date: Wed, 27 Apr 2022 23:22:55 +0000
Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query
Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18439 )
Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................
Patch Set 3:
Build Successful
https://jenkins.impala.io/job/gerrit-code-review-checks/10489/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.
--
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 3
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Comment-Date: Mon, 25 Apr 2022 16:49:17 +0000
Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query
Posted by "Wenzhe Zhou (Code Review)" <ge...@cloudera.org>.
Wenzhe Zhou has uploaded a new patch set (#4). ( http://gerrit.cloudera.org:8080/18439 )
Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................
IMPALA-11263: Coordinator hang when cancelling a query
In a rare case, callback Coordinator::BackendState::ExecCompleteCb()
is not called for the corresponding ExecQueryFInstances RPC when the
RPC is cancelled. This causes coordinator to wait indefinitely when
calling Coordinator::BackendState::Cancel() to cancel a fragment
instance.
This patch adds timeout for BackendState::WaitOnExecLocked() so that
coordinator will not be blocked indefinitely when cancelling a query.
Testing:
- Added a test case to simulate the callback missing when a query
is failed. Verified that the coordinator would hang without the
fixing, and would not hang with the fixing.
- Passed exhaustive-debug tests.
Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
---
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M tests/custom_cluster/test_rpc_timeout.py
3 files changed, 157 insertions(+), 97 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/18439/4
--
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 4
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query
Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18439 )
Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................
Patch Set 5: Verified+1
--
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 5
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Comment-Date: Sat, 30 Apr 2022 03:23:10 +0000
Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query
Posted by "Wenzhe Zhou (Code Review)" <ge...@cloudera.org>.
Wenzhe Zhou has posted comments on this change. ( http://gerrit.cloudera.org:8080/18439 )
Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................
Patch Set 2:
(1 comment)
http://gerrit.cloudera.org:8080/#/c/18439/1/tests/custom_cluster/test_rpc_timeout.py
File tests/custom_cluster/test_rpc_timeout.py:
http://gerrit.cloudera.org:8080/#/c/18439/1/tests/custom_cluster/test_rpc_timeout.py@211
PS1, Line 211: _
> flake8: E131 continuation line unaligned for hanging indent
fixed
--
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 2
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Comment-Date: Thu, 21 Apr 2022 23:11:02 +0000
Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query
Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/18439 )
Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................
Patch Set 3:
(1 comment)
http://gerrit.cloudera.org:8080/#/c/18439/3/be/src/runtime/coordinator-backend-state.cc
File be/src/runtime/coordinator-backend-state.cc:
http://gerrit.cloudera.org:8080/#/c/18439/3/be/src/runtime/coordinator-backend-state.cc@204
PS3, Line 204: while
> nit. Use of 'IF' here probably is better for understanding the code.
That's a really good point. I missed that this while loop is now not really a while loop.
We could also add a DCHECK after the exec_done_cv_.Wait(*l) line.
--
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 3
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Comment-Date: Wed, 27 Apr 2022 16:28:28 +0000
Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query
Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18439 )
Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................
Patch Set 5:
Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/8071/ DRY_RUN=false
--
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 5
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Comment-Date: Fri, 29 Apr 2022 18:26:09 +0000
Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query
Posted by "Wenzhe Zhou (Code Review)" <ge...@cloudera.org>.
Wenzhe Zhou has posted comments on this change. ( http://gerrit.cloudera.org:8080/18439 )
Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................
Patch Set 4:
(1 comment)
http://gerrit.cloudera.org:8080/#/c/18439/4/be/src/runtime/coordinator-backend-state.cc
File be/src/runtime/coordinator-backend-state.cc:
http://gerrit.cloudera.org:8080/#/c/18439/4/be/src/runtime/coordinator-backend-state.cc@204
PS4, Line 204: if (!exec_done_) {
: if (timeout_ms <= 0) {
: exec_done_cv_.Wait(*l);
: DCHECK(exec_done_);
: return true;
: } else {
: bool notified = exec_done_cv_.WaitFor(*l, timeout_ms * MICROS_PER_MILLI);
: if (notified) DCHECK(exec_done_);
: return notified;
: }
: }
> After reading our existing uses of condition variables, I'm changing my min
Thanks for your detail comments. Will fix it as suggested
--
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 4
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Comment-Date: Fri, 29 Apr 2022 00:17:17 +0000
Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query
Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/18439 )
Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................
Patch Set 5: Code-Review+2
Thanks, this looks good to me.
--
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 5
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Comment-Date: Fri, 29 Apr 2022 18:15:24 +0000
Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query
Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/18439 )
Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................
Patch Set 3:
(1 comment)
This is looking good. I just had one comment to check my understanding.
http://gerrit.cloudera.org:8080/#/c/18439/3/be/src/runtime/coordinator-backend-state.cc
File be/src/runtime/coordinator-backend-state.cc:
http://gerrit.cloudera.org:8080/#/c/18439/3/be/src/runtime/coordinator-backend-state.cc@621
PS3, Line 621: VLogForBackend(Substitute(
: "Exec() rpc was not completed or cancelled after waiting for $0 ms",
: FLAGS_backend_client_rpc_timeout_ms));
: exec_done_ = true;
: notify_exec_done = true;
Just thinking this through: if this times out, then we will fall through and send a cancel RPC at line 679. That seems reasonable, because we can't be sure the RPC got cancelled.
--
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 3
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Comment-Date: Tue, 26 Apr 2022 23:04:27 +0000
Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query
Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18439 )
Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................
Patch Set 5:
Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/8072/ DRY_RUN=false
--
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 5
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Comment-Date: Fri, 29 Apr 2022 23:02:50 +0000
Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query
Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18439 )
Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................
Patch Set 2:
Build Successful
https://jenkins.impala.io/job/gerrit-code-review-checks/10478/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.
--
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 2
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Thu, 21 Apr 2022 23:08:13 +0000
Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query
Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/18439 )
Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................
Patch Set 4:
(1 comment)
http://gerrit.cloudera.org:8080/#/c/18439/4/be/src/runtime/coordinator-backend-state.cc
File be/src/runtime/coordinator-backend-state.cc:
http://gerrit.cloudera.org:8080/#/c/18439/4/be/src/runtime/coordinator-backend-state.cc@204
PS4, Line 204: if (!exec_done_) {
: if (timeout_ms <= 0) {
: exec_done_cv_.Wait(*l);
: DCHECK(exec_done_);
: return true;
: } else {
: bool notified = exec_done_cv_.WaitFor(*l, timeout_ms * MICROS_PER_MILLI);
: if (notified) DCHECK(exec_done_);
: return notified;
: }
: }
After reading our existing uses of condition variables, I'm changing my mind. It seems like it is safer if we can handle spurious wakeups by using a while loop. All our other code calls Wait() in a loop. Sorry I missed that before, we should probably do something similar to this:
bool timed_out = false;
while (!exec_done_ && !timed_out) {
if (timeout_ms <= 0) {
exec_done_cv_.Wait(*l);
} else {
// WaitFor returns true if notified
timed_out = !exec_done_cv_.WaitFor(*l, timeout_ms * MICROS_PER_MILLI);
}
}
return !timed_out;
We may want to track the time for the second case and sleep the remainder.
--
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 4
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Comment-Date: Thu, 28 Apr 2022 22:01:41 +0000
Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query
Posted by "Wenzhe Zhou (Code Review)" <ge...@cloudera.org>.
Wenzhe Zhou has uploaded a new patch set (#5). ( http://gerrit.cloudera.org:8080/18439 )
Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................
IMPALA-11263: Coordinator hang when cancelling a query
In a rare case, callback Coordinator::BackendState::ExecCompleteCb()
is not called for the corresponding ExecQueryFInstances RPC when the
RPC is cancelled. This causes coordinator to wait indefinitely when
calling Coordinator::BackendState::Cancel() to cancel a fragment
instance.
This patch adds timeout for BackendState::WaitOnExecLocked() so that
coordinator will not be blocked indefinitely when cancelling a query.
Testing:
- Added a test case to simulate the callback missing when a query
is failed. Verified that the coordinator would hang without the
fixing, and would not hang with the fixing.
- Passed exhaustive-debug tests.
Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
---
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M tests/custom_cluster/test_rpc_timeout.py
3 files changed, 155 insertions(+), 97 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/18439/5
--
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 5
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query
Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/18439 )
Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................
Patch Set 3:
(2 comments)
Looks good to me.
http://gerrit.cloudera.org:8080/#/c/18439/3/be/src/runtime/coordinator-backend-state.cc
File be/src/runtime/coordinator-backend-state.cc:
http://gerrit.cloudera.org:8080/#/c/18439/3/be/src/runtime/coordinator-backend-state.cc@204
PS3, Line 204: while
nit. Use of 'IF' here probably is better for understanding the code.
http://gerrit.cloudera.org:8080/#/c/18439/3/be/src/runtime/coordinator-backend-state.cc@622
PS3, Line 622: completed or cancelled
nit. "Exec() rpc was not responsive after waiting for $0 ms" probably describes the state more precisely.
--
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 3
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Comment-Date: Wed, 27 Apr 2022 15:01:54 +0000
Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query
Posted by "Wenzhe Zhou (Code Review)" <ge...@cloudera.org>.
Wenzhe Zhou has uploaded a new patch set (#3). ( http://gerrit.cloudera.org:8080/18439 )
Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................
IMPALA-11263: Coordinator hang when cancelling a query
In a rare case, callback Coordinator::BackendState::ExecCompleteCb()
is not called for the corresponding ExecQueryFInstances RPC when the
RPC is cancelled. This causes coordinator to wait indefinitely when
calling Coordinator::BackendState::Cancel() to cancel a fragment
instance.
This patch adds timeout for BackendState::WaitOnExecLocked() so that
coordinator will not be blocked indefinitely when cancelling a query.
Testing:
- Added a test case to simulate the callback missing when a query
is failed. Verified that the coordinator would hang without the
fixing, and would not hang with the fixing.
- Passed exhaustive-debug tests.
Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
---
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M tests/custom_cluster/test_rpc_timeout.py
3 files changed, 153 insertions(+), 96 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/18439/3
--
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 3
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query
Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18439 )
Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................
Patch Set 5: Verified-1
Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/8071/
--
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 5
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Comment-Date: Fri, 29 Apr 2022 22:51:52 +0000
Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query
Posted by "Wenzhe Zhou (Code Review)" <ge...@cloudera.org>.
Wenzhe Zhou has posted comments on this change. ( http://gerrit.cloudera.org:8080/18439 )
Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................
Patch Set 5:
Hit flaky test IMPALA-10927 and IMPALA-11276
--
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 5
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Comment-Date: Fri, 29 Apr 2022 23:01:48 +0000
Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11263: Coordinator hang when cancelling a query
Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18439 )
Change subject: IMPALA-11263: Coordinator hang when cancelling a query
......................................................................
Patch Set 1:
(1 comment)
http://gerrit.cloudera.org:8080/#/c/18439/1/tests/custom_cluster/test_rpc_timeout.py
File tests/custom_cluster/test_rpc_timeout.py:
http://gerrit.cloudera.org:8080/#/c/18439/1/tests/custom_cluster/test_rpc_timeout.py@211
PS1, Line 211: _
flake8: E131 continuation line unaligned for hanging indent
--
To view, visit http://gerrit.cloudera.org:8080/18439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I915511afe2df3017cbbf37f6aff3c5ff7f5473be
Gerrit-Change-Number: 18439
Gerrit-PatchSet: 1
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Thu, 21 Apr 2022 22:39:56 +0000
Gerrit-HasComments: Yes