You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Wenzhe Zhou (Code Review)" <ge...@cloudera.org> on 2020/12/10 00:12:13 UTC

[Impala-ASF-CR] IMPALA-10336: Coordinator return incorrect error to client

Wenzhe Zhou has uploaded this change for review. ( http://gerrit.cloudera.org:8080/16849


Change subject: IMPALA-10336: Coordinator return incorrect error to client
......................................................................

IMPALA-10336: Coordinator return incorrect error to client

Due to race condition, coordinator could set execution status as RPC
aborted due to cancellation. This internal error should not be
returned to client.
Instead of separate function call to get backend status, add more
parameters in function BackendState::ApplyExecStatusReport() to return
backend status.

Testing:
 - Could not re-produce the issue by running the test case
   test_scanners.py::TestOrc::test_type_conversions_hive3 in a loop.
 - Passed exhausive test.

Change-Id: I75f252e43006c6ff6980800e3254672de396b318
---
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M be/src/runtime/coordinator.cc
3 files changed, 21 insertions(+), 8 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/49/16849/1
-- 
To view, visit http://gerrit.cloudera.org:8080/16849
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I75f252e43006c6ff6980800e3254672de396b318
Gerrit-Change-Number: 16849
Gerrit-PatchSet: 1
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>

[Impala-ASF-CR] IMPALA-10336: Coordinator return incorrect error to client

Posted by "Wenzhe Zhou (Code Review)" <ge...@cloudera.org>.
Wenzhe Zhou has uploaded a new patch set (#2). ( http://gerrit.cloudera.org:8080/16849 )

Change subject: IMPALA-10336: Coordinator return incorrect error to client
......................................................................

IMPALA-10336: Coordinator return incorrect error to client

Due to race condition, coordinator could set execution status as RPC
aborted due to cancellation. This internal error should not be
returned to client.
Instead of separate function call to get backend status, add more
parameters in function BackendState::ApplyExecStatusReport() to return
backend status.

Testing:
 - Could not re-produce the issue by running the test case
   test_scanners.py::TestOrc::test_type_conversions_hive3 in a loop.
 - Passed exhausive test.

Change-Id: I75f252e43006c6ff6980800e3254672de396b318
---
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M be/src/runtime/coordinator.cc
3 files changed, 23 insertions(+), 10 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/49/16849/2
-- 
To view, visit http://gerrit.cloudera.org:8080/16849
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I75f252e43006c6ff6980800e3254672de396b318
Gerrit-Change-Number: 16849
Gerrit-PatchSet: 2
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>

[Impala-ASF-CR] IMPALA-10336: Coordinator return incorrect error to client

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16849 )

Change subject: IMPALA-10336: Coordinator return incorrect error to client
......................................................................


Patch Set 3:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/7875/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/16849
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I75f252e43006c6ff6980800e3254672de396b318
Gerrit-Change-Number: 16849
Gerrit-PatchSet: 3
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Comment-Date: Thu, 17 Dec 2020 03:14:07 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10336: Coordinator return incorrect error to client

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16849 )

Change subject: IMPALA-10336: Coordinator return incorrect error to client
......................................................................


Patch Set 4: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/16849
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I75f252e43006c6ff6980800e3254672de396b318
Gerrit-Change-Number: 16849
Gerrit-PatchSet: 4
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Comment-Date: Mon, 21 Dec 2020 19:25:23 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10336: Coordinator return incorrect error to client

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/16849 )

Change subject: IMPALA-10336: Coordinator return incorrect error to client
......................................................................


Patch Set 3: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/16849
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I75f252e43006c6ff6980800e3254672de396b318
Gerrit-Change-Number: 16849
Gerrit-PatchSet: 3
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Comment-Date: Mon, 21 Dec 2020 19:25:09 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10336: Coordinator return incorrect error to client

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/16849 )

Change subject: IMPALA-10336: Coordinator return incorrect error to client
......................................................................


Patch Set 2:

So I think the underlying issue here is that the ExecCompleteCb() can be called with an ABORTED status even though the rpc was actually sent successfully. The KRPC docs are a little vague about this, but see https://github.com/apache/impala/blob/master/be/src/kudu/rpc/outbound_call.cc#L246

This means we can get an "OK" status report for a backend that had its status previously set as ABORTED. Since that backend will be 'done' (because it has an error status), we'll try to set the overall query status to ABORTED.

Since the cancellation that happened that caused the rpc to get ABORTED happened because of another backend reporting an error, most of the time the error from that backend will end up set as the overall query status, but there's a race where the thread processing the report for the ABORTED backend could win and end up setting the overall query status to ABORTED first.

I don't think this patch fixes that - its possible that the backend's status will be set to ABORTED prior to the call to UpdateBackendExecStatus() for it, in which case I think that ApplyExecStatusReport() will still just return ABORTED as the error in the status out parameter that you added.

I think you could repro the situation if you add some artificial delays in two places: 1) have the backend that does not encounter an error sleep indefinitely before responding to the Exec() rpc but after starting execution (so that KRPC will say that it was ABORTED but it will still try to send status reports) and 2) have Coordinator::UpdateBackendExecStatus sleep indefinitely when processing the status report from the backend that reports the "real" error after it causes the other Exec rpcs to be cancelled (i.e. after the call to exec_rpcs_status_barrier_.NotifyRemaining)

I think instead the solution is to ensure that if a backend has Cancel() called on it, that its status will end up as CANCELLED, and not ABORTED. This fixes the problem in the JIRA because Coordinator::UpdateExecState already has logic where it will prefer to set the overall query status to a "real" status over CANCELLED, see https://github.com/apache/impala/blob/master/be/src/runtime/coordinator.cc#L704, and its a nice invariant that a cancelled backend will have a status of Cancelled()

Of course, let me know if anything above is wrong.


-- 
To view, visit http://gerrit.cloudera.org:8080/16849
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I75f252e43006c6ff6980800e3254672de396b318
Gerrit-Change-Number: 16849
Gerrit-PatchSet: 2
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Tue, 15 Dec 2020 21:19:37 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10336: Coordinator return incorrect error to client

Posted by "Wenzhe Zhou (Code Review)" <ge...@cloudera.org>.
Wenzhe Zhou has posted comments on this change. ( http://gerrit.cloudera.org:8080/16849 )

Change subject: IMPALA-10336: Coordinator return incorrect error to client
......................................................................


Patch Set 2:

Right. I could reproduce the situation by adding some artificial delays in 3 places: QueryExecMgr.StartQuery(), Coordinator.UpdateBackendExecStatus(), and Coordinator::StartBackendExec() when running test case
test_scanners.py::TestOrc::test_type_conversions_hive3.


-- 
To view, visit http://gerrit.cloudera.org:8080/16849
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I75f252e43006c6ff6980800e3254672de396b318
Gerrit-Change-Number: 16849
Gerrit-PatchSet: 2
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Comment-Date: Thu, 17 Dec 2020 02:50:41 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10336: Coordinator return incorrect error to client

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16849 )

Change subject: IMPALA-10336: Coordinator return incorrect error to client
......................................................................


Patch Set 2:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/7835/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/16849
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I75f252e43006c6ff6980800e3254672de396b318
Gerrit-Change-Number: 16849
Gerrit-PatchSet: 2
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Thu, 10 Dec 2020 23:22:17 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10336: Coordinator return incorrect error to client

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16849 )

Change subject: IMPALA-10336: Coordinator return incorrect error to client
......................................................................


Patch Set 1:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/7815/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/16849
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I75f252e43006c6ff6980800e3254672de396b318
Gerrit-Change-Number: 16849
Gerrit-PatchSet: 1
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Thu, 10 Dec 2020 00:34:56 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10336: Coordinator return incorrect error to client

Posted by "Wenzhe Zhou (Code Review)" <ge...@cloudera.org>.
Wenzhe Zhou has uploaded a new patch set (#3). ( http://gerrit.cloudera.org:8080/16849 )

Change subject: IMPALA-10336: Coordinator return incorrect error to client
......................................................................

IMPALA-10336: Coordinator return incorrect error to client

Due to race condition, coordinator could set execution status as RPC
aborted due to cancellation. This internal error should not be
returned to client.
This patch fixed the issue by setting the backend status as CANCELLED
instead of ABORTED if the exec RPC was aborted due to cancellation.

Testing:
 - Manual tests
   Since this is a racy bug, I could only reproduce the situation by
   adding some artificial delays in 3 places: QueryExecMgr.StartQuery(),
   Coordinator.UpdateBackendExecStatus(), and
   Coordinator::StartBackendExec() when running test case
   test_scanners.py::TestOrc::test_type_conversions_hive3.
   Verified that the issue did not happen after applying this patch
   by running test_scanners.py::TestOrc::test_type_conversions_hive3
   in a loop for hours.
 - Passed exhausive test.

Change-Id: I75f252e43006c6ff6980800e3254672de396b318
---
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
2 files changed, 15 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/49/16849/3
-- 
To view, visit http://gerrit.cloudera.org:8080/16849
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I75f252e43006c6ff6980800e3254672de396b318
Gerrit-Change-Number: 16849
Gerrit-PatchSet: 3
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>

[Impala-ASF-CR] IMPALA-10336: Coordinator return incorrect error to client

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/16849 )

Change subject: IMPALA-10336: Coordinator return incorrect error to client
......................................................................

IMPALA-10336: Coordinator return incorrect error to client

Due to race condition, coordinator could set execution status as RPC
aborted due to cancellation. This internal error should not be
returned to client.
This patch fixed the issue by setting the backend status as CANCELLED
instead of ABORTED if the exec RPC was aborted due to cancellation.

Testing:
 - Manual tests
   Since this is a racy bug, I could only reproduce the situation by
   adding some artificial delays in 3 places: QueryExecMgr.StartQuery(),
   Coordinator.UpdateBackendExecStatus(), and
   Coordinator::StartBackendExec() when running test case
   test_scanners.py::TestOrc::test_type_conversions_hive3.
   Verified that the issue did not happen after applying this patch
   by running test_scanners.py::TestOrc::test_type_conversions_hive3
   in a loop for hours.
 - Passed exhausive test.

Change-Id: I75f252e43006c6ff6980800e3254672de396b318
Reviewed-on: http://gerrit.cloudera.org:8080/16849
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
2 files changed, 15 insertions(+), 2 deletions(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved; Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/16849
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I75f252e43006c6ff6980800e3254672de396b318
Gerrit-Change-Number: 16849
Gerrit-PatchSet: 5
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>

[Impala-ASF-CR] IMPALA-10336: Coordinator return incorrect error to client

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16849 )

Change subject: IMPALA-10336: Coordinator return incorrect error to client
......................................................................


Patch Set 4: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/16849
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I75f252e43006c6ff6980800e3254672de396b318
Gerrit-Change-Number: 16849
Gerrit-PatchSet: 4
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Comment-Date: Tue, 22 Dec 2020 01:02:46 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10336: Coordinator return incorrect error to client

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16849 )

Change subject: IMPALA-10336: Coordinator return incorrect error to client
......................................................................


Patch Set 4:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6797/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/16849
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I75f252e43006c6ff6980800e3254672de396b318
Gerrit-Change-Number: 16849
Gerrit-PatchSet: 4
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Comment-Date: Mon, 21 Dec 2020 19:25:24 +0000
Gerrit-HasComments: No