You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org> on 2017/08/03 19:13:44 UTC

[Impala-ASF-CR] IMPALA-5749: coordinator race hits DCHECK 'num remaining backends > 0'

Thomas Tauber-Marshall has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/7577

Change subject: IMPALA-5749: coordinator race hits DCHECK 'num_remaining_backends_ > 0'
......................................................................

IMPALA-5749: coordinator race hits DCHECK 'num_remaining_backends_ > 0'

In Coordinator::UpdateBackendExecStatus(), we check if the backend
has already completed with BackendState::IsDone() and return without
applying the update if so to avoid updating num_remaining_backends_
twice for the same completed backend.

The problem is that the value of BackendState::IsDone() is updated by
the call to BackendState::ApplyExecStatusReport() that comes after it,
but these operations are not performed atomically, so if there are
two simultaneous calls to UpdateBackendExecStatus(), they can both
call IsDone(), both get 'false', and then proceed to erroneously both
update num_remaining_backends_, hitting a DCHECK.

The solution is to perform both the call to IsDone() and the update to
it atomically by holding the BackendState::lock_.

Testing:
- Ran test_finst_cancel_when_query_complete 10,000 times without
  hitting the DCHECK (previously, it would hit about once per 300
  runs).

Change-Id: I1528661e5df6d9732ebfeb414576c82ec5c92241
---
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M be/src/runtime/coordinator.cc
3 files changed, 16 insertions(+), 11 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/77/7577/1
-- 
To view, visit http://gerrit.cloudera.org:8080/7577
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I1528661e5df6d9732ebfeb414576c82ec5c92241
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>

[Impala-ASF-CR] IMPALA-5749: coordinator race hits DCHECK 'num remaining backends > 0'

Posted by "Henry Robinson (Code Review)" <ge...@cloudera.org>.
Henry Robinson has posted comments on this change.

Change subject: IMPALA-5749: coordinator race hits DCHECK 'num_remaining_backends_ > 0'
......................................................................


Patch Set 3: Code-Review+1

(3 comments)

Thanks - this seems a lot better to me. Did you re-run test_finst_cancel?

http://gerrit.cloudera.org:8080/#/c/7577/3/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

PS3, Line 971: VLOG_QUERY << "Backend completed: "
             :         << " host=" << backend_state->impalad_address()
             :         << " remaining=" << num_remaining_backends_ - 1;
             :     if (VLOG_QUERY_IS_ON && num_remaining_backends_ > 1) {
not your change, but while we're here: can you hoist the VLOG_QUERY_IS_ON check to include line 971?


Line 987:     }
might be clearer just to return Status::OK() here, and avoid the need for an 'else'.


PS3, Line 991: !backend_state->IsDone()
let's remove this. I can see the race where IsDone() becomes true after line 953 if there are two concurrent reports, but I would expect it to be harmless. We also should really fix the possibility of concurrent reports properly - can you file a JIRA?


-- 
To view, visit http://gerrit.cloudera.org:8080/7577
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1528661e5df6d9732ebfeb414576c82ec5c92241
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5749: coordinator race hits DCHECK 'num remaining backends > 0'

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Hello Henry Robinson, Michael Ho, Sailesh Mukil,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7577

to look at the new patch set (#5).

Change subject: IMPALA-5749: coordinator race hits DCHECK 'num_remaining_backends_ > 0'
......................................................................

IMPALA-5749: coordinator race hits DCHECK 'num_remaining_backends_ > 0'

In Coordinator::UpdateBackendExecStatus(), we check if the backend
has already completed with BackendState::IsDone() and return without
applying the update if so to avoid updating num_remaining_backends_
twice for the same completed backend.

The problem is that the value of BackendState::IsDone() is updated by
the call to BackendState::ApplyExecStatusReport() that comes after it,
but these operations are not performed atomically, so if there are
two simultaneous calls to UpdateBackendExecStatus(), they can both
call IsDone(), both get 'false', and then proceed to erroneously both
update num_remaining_backends_, hitting a DCHECK.

This patch modifies ApplyExecStatusReport to return true iff this
report transitioned the backend to a done status, and then only
updates num_remaining_backends_ in this case, ensuring it is only
updated once per backend.

Testing:
- Ran test_finst_cancel_when_query_complete 10,000 times without
  hitting the DCHECK (previously, it would hit about once per 300
  runs).

Change-Id: I1528661e5df6d9732ebfeb414576c82ec5c92241
---
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M be/src/runtime/coordinator.cc
3 files changed, 52 insertions(+), 52 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/77/7577/5
-- 
To view, visit http://gerrit.cloudera.org:8080/7577
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1528661e5df6d9732ebfeb414576c82ec5c92241
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>

[Impala-ASF-CR] IMPALA-5749: coordinator race hits DCHECK 'num remaining backends > 0'

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-5749: coordinator race hits DCHECK 'num_remaining_backends_ > 0'
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7577/1/be/src/runtime/coordinator-backend-state.cc
File be/src/runtime/coordinator-backend-state.cc:

PS1, Line 239:   // If this backend completed previously, don't apply the update.
             :   if (IsDoneInternal()) return false;
Sorry, I don't quite understand the race here. Why won't line 250 below prevent the same fragment instance from decrementing num_remaining_instances_ twice ?


-- 
To view, visit http://gerrit.cloudera.org:8080/7577
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1528661e5df6d9732ebfeb414576c82ec5c92241
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5749: coordinator race hits DCHECK 'num remaining backends > 0'

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-5749: coordinator race hits DCHECK 'num_remaining_backends_ > 0'
......................................................................


Patch Set 5:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/1123/

-- 
To view, visit http://gerrit.cloudera.org:8080/7577
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1528661e5df6d9732ebfeb414576c82ec5c92241
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5749: coordinator race hits DCHECK 'num remaining backends > 0'

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-5749: coordinator race hits DCHECK 'num_remaining_backends_ > 0'
......................................................................


Patch Set 5: Verified+1

-- 
To view, visit http://gerrit.cloudera.org:8080/7577
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1528661e5df6d9732ebfeb414576c82ec5c92241
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5749: coordinator race hits DCHECK 'num remaining backends > 0'

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-5749: coordinator race hits DCHECK 'num_remaining_backends_ > 0'
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7577/1/be/src/runtime/coordinator-backend-state.cc
File be/src/runtime/coordinator-backend-state.cc:

PS1, Line 239:   // If this backend completed previously, don't apply the update.
             :   if (IsDoneInternal()) return false;
> Sorry, I don't quite understand the race here. Why won't line 250 below pre
Never mind. I mis-read the commit message. The DHCECK triggered is not line 267 below. Sorry for the noise.


-- 
To view, visit http://gerrit.cloudera.org:8080/7577
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1528661e5df6d9732ebfeb414576c82ec5c92241
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5749: coordinator race hits DCHECK 'num remaining backends > 0'

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged.

Change subject: IMPALA-5749: coordinator race hits DCHECK 'num_remaining_backends_ > 0'
......................................................................


IMPALA-5749: coordinator race hits DCHECK 'num_remaining_backends_ > 0'

In Coordinator::UpdateBackendExecStatus(), we check if the backend
has already completed with BackendState::IsDone() and return without
applying the update if so to avoid updating num_remaining_backends_
twice for the same completed backend.

The problem is that the value of BackendState::IsDone() is updated by
the call to BackendState::ApplyExecStatusReport() that comes after it,
but these operations are not performed atomically, so if there are
two simultaneous calls to UpdateBackendExecStatus(), they can both
call IsDone(), both get 'false', and then proceed to erroneously both
update num_remaining_backends_, hitting a DCHECK.

This patch modifies ApplyExecStatusReport to return true iff this
report transitioned the backend to a done status, and then only
updates num_remaining_backends_ in this case, ensuring it is only
updated once per backend.

Testing:
- Ran test_finst_cancel_when_query_complete 10,000 times without
  hitting the DCHECK (previously, it would hit about once per 300
  runs).

Change-Id: I1528661e5df6d9732ebfeb414576c82ec5c92241
Reviewed-on: http://gerrit.cloudera.org:8080/7577
Reviewed-by: Dan Hecht <dh...@cloudera.com>
Tested-by: Impala Public Jenkins
---
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M be/src/runtime/coordinator.cc
3 files changed, 52 insertions(+), 52 deletions(-)

Approvals:
  Impala Public Jenkins: Verified
  Dan Hecht: Looks good to me, approved



-- 
To view, visit http://gerrit.cloudera.org:8080/7577
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I1528661e5df6d9732ebfeb414576c82ec5c92241
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>

[Impala-ASF-CR] IMPALA-5749: coordinator race hits DCHECK 'num remaining backends > 0'

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has posted comments on this change.

Change subject: IMPALA-5749: coordinator race hits DCHECK 'num_remaining_backends_ > 0'
......................................................................


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7577/4/be/src/runtime/coordinator-backend-state.h
File be/src/runtime/coordinator-backend-state.h:

Line 113:   /// debugging aid for backend deadlocks.
> does it still make sense to expose that?  as this bug shows, seems better t
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7577
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1528661e5df6d9732ebfeb414576c82ec5c92241
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5749: coordinator race hits DCHECK 'num remaining backends > 0'

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Hello Michael Ho, Sailesh Mukil,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7577

to look at the new patch set (#2).

Change subject: IMPALA-5749: coordinator race hits DCHECK 'num_remaining_backends_ > 0'
......................................................................

IMPALA-5749: coordinator race hits DCHECK 'num_remaining_backends_ > 0'

In Coordinator::UpdateBackendExecStatus(), we check if the backend
has already completed with BackendState::IsDone() and return without
applying the update if so to avoid updating num_remaining_backends_
twice for the same completed backend.

The problem is that the value of BackendState::IsDone() is updated by
the call to BackendState::ApplyExecStatusReport() that comes after it,
but these operations are not performed atomically, so if there are
two simultaneous calls to UpdateBackendExecStatus(), they can both
call IsDone(), both get 'false', and then proceed to erroneously both
update num_remaining_backends_, hitting a DCHECK.

The solution is to perform both the call to IsDone() and the update to
it atomically by holding the BackendState::lock_.

Testing:
- Ran test_finst_cancel_when_query_complete 10,000 times without
  hitting the DCHECK (previously, it would hit about once per 300
  runs).

Change-Id: I1528661e5df6d9732ebfeb414576c82ec5c92241
---
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M be/src/runtime/coordinator.cc
3 files changed, 18 insertions(+), 11 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/77/7577/2
-- 
To view, visit http://gerrit.cloudera.org:8080/7577
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1528661e5df6d9732ebfeb414576c82ec5c92241
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>

[Impala-ASF-CR] IMPALA-5749: coordinator race hits DCHECK 'num remaining backends > 0'

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has posted comments on this change.

Change subject: IMPALA-5749: coordinator race hits DCHECK 'num_remaining_backends_ > 0'
......................................................................


Patch Set 4: Code-Review+1

Carrying forward +1

-- 
To view, visit http://gerrit.cloudera.org:8080/7577
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1528661e5df6d9732ebfeb414576c82ec5c92241
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5749: coordinator race hits DCHECK 'num remaining backends > 0'

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-5749: coordinator race hits DCHECK 'num_remaining_backends_ > 0'
......................................................................


Patch Set 1: Code-Review+1

(1 comment)

Not sure if Henry has more comments ?

http://gerrit.cloudera.org:8080/#/c/7577/1/be/src/runtime/coordinator-backend-state.h
File be/src/runtime/coordinator-backend-state.h:

Line 75:   /// this update was not applied because this backend has previously completed.
Please also add:

This can happen because the final update for a fragment instance may arrive concurrently with previous updates and they can be processed out of order.


-- 
To view, visit http://gerrit.cloudera.org:8080/7577
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1528661e5df6d9732ebfeb414576c82ec5c92241
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5749: coordinator race hits DCHECK 'num remaining backends > 0'

Posted by "Sailesh Mukil (Code Review)" <ge...@cloudera.org>.
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-5749: coordinator race hits DCHECK 'num_remaining_backends_ > 0'
......................................................................


Patch Set 1: Code-Review+1

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7577/1//COMMIT_MSG
Commit Message:

PS1, Line 16: so if there are
            : two simultaneous calls to UpdateBackendExecStatus()
nit rephrase:
so if there are two reports for the last fragment instance in a backend that arrive at the coordinator at the same time, they can call UpdateBackendExecStatus() simultaneously, both call IsDone(), ...


-- 
To view, visit http://gerrit.cloudera.org:8080/7577
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1528661e5df6d9732ebfeb414576c82ec5c92241
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5749: coordinator race hits DCHECK 'num remaining backends > 0'

Posted by "Henry Robinson (Code Review)" <ge...@cloudera.org>.
Henry Robinson has posted comments on this change.

Change subject: IMPALA-5749: coordinator race hits DCHECK 'num_remaining_backends_ > 0'
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7577/2/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

PS2, Line 947: bool done;
             :   if (!backend_state->ApplyExecStatusReport(params, &exec_summary_, &progress_, &done)) {
             :     // ignore stray exec reports if we're already done, otherwise we lose
             :     // track of num_remaining_backends_
We've got two 'done' concepts here that are close enough to make this a bit confusing. The first is whether the backend was already done *before* this report (return value), the second is whether the backend is done *after* this report is applied (output param). 

It seems to me that we could remove the output parameter. The idea is to have ApplyExecStatusReport() return true iff this is the report that switched the BE state to 'done'. Then I think we can clean the logic here up a bit:

  1. UpdateInsertExecStatus()
  2. if (ApplyExecStatusReport()) { // we are report that finished this 
  backend state
  2a  UpdateStatus() // Only need to do if AESR() == true, otherwise we 
  are not first bad status that this BE has reported
  2b. --num_remaining_backends etc. 
  3 } else { if (returned_all_results_) return Status::Cancelled(); }
  4 return Status::OK();


-- 
To view, visit http://gerrit.cloudera.org:8080/7577
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1528661e5df6d9732ebfeb414576c82ec5c92241
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5749: coordinator race hits DCHECK 'num remaining backends > 0'

Posted by "Sailesh Mukil (Code Review)" <ge...@cloudera.org>.
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-5749: coordinator race hits DCHECK 'num_remaining_backends_ > 0'
......................................................................


Patch Set 1:

> Does this trigger only when there are two concurrent calls to
 > UpdateBackendExecStatus() from the same backend? If so, do we
 > understand why that happens so often?

My understanding is this:
A fragment instance sends reports every 'n' seconds. Due to a congested network, two of these reports for the same fragment instance from a backend can arrive at the coordinator and start being processed at around the same time, hence leading to this issue.

Ideally a second report cannot be send until the first one is ACKd by the coordinator, since a lock is held until the report is ACKd, in the ReportProfileThread(); but there is only one case where a second report will be sent before the first one is responded to, i.e.  from FragmentInstanceState::Finalize().

So ReportProfileThread() sends the one report of the last finstance, then Finalize() sends the second report of the same finstance before the first one is responded to.

-- 
To view, visit http://gerrit.cloudera.org:8080/7577
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1528661e5df6d9732ebfeb414576c82ec5c92241
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5749: coordinator race hits DCHECK 'num remaining backends > 0'

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Hello Michael Ho, Sailesh Mukil,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7577

to look at the new patch set (#3).

Change subject: IMPALA-5749: coordinator race hits DCHECK 'num_remaining_backends_ > 0'
......................................................................

IMPALA-5749: coordinator race hits DCHECK 'num_remaining_backends_ > 0'

In Coordinator::UpdateBackendExecStatus(), we check if the backend
has already completed with BackendState::IsDone() and return without
applying the update if so to avoid updating num_remaining_backends_
twice for the same completed backend.

The problem is that the value of BackendState::IsDone() is updated by
the call to BackendState::ApplyExecStatusReport() that comes after it,
but these operations are not performed atomically, so if there are
two simultaneous calls to UpdateBackendExecStatus(), they can both
call IsDone(), both get 'false', and then proceed to erroneously both
update num_remaining_backends_, hitting a DCHECK.

This patch modifies ApplyExecStatusReport to return true iff this
report transitioned the backend to a done status, and then only
updates num_remaining_backends_ in this case, ensuring it is only
updated once per backend.

Testing:
- Ran test_finst_cancel_when_query_complete 10,000 times without
  hitting the DCHECK (previously, it would hit about once per 300
  runs).

Change-Id: I1528661e5df6d9732ebfeb414576c82ec5c92241
---
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M be/src/runtime/coordinator.cc
3 files changed, 31 insertions(+), 31 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/77/7577/3
-- 
To view, visit http://gerrit.cloudera.org:8080/7577
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1528661e5df6d9732ebfeb414576c82ec5c92241
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>

[Impala-ASF-CR] IMPALA-5749: coordinator race hits DCHECK 'num remaining backends > 0'

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has posted comments on this change.

Change subject: IMPALA-5749: coordinator race hits DCHECK 'num_remaining_backends_ > 0'
......................................................................


Patch Set 4:

(4 comments)

> (3 comments)
 > 
 > Thanks - this seems a lot better to me. Did you re-run
 > test_finst_cancel?

Yes, I ran it in a loop again.

http://gerrit.cloudera.org:8080/#/c/7577/2/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

PS2, Line 947: 
             :     // for now, abort the query if we see any error except if returned_all_results_ is
             :     // true (UpdateStatus() initiates cancellation, if it hasn't already been)
             :     // TODO: clarify control flow here,
> We've got two 'done' concepts here that are close enough to make this a bit
Done


http://gerrit.cloudera.org:8080/#/c/7577/3/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

PS3, Line 971:                  << backend_state->impalad_address();
             :           break;
             :         }
             :       }
> not your change, but while we're here: can you hoist the VLOG_QUERY_IS_ON c
Done


Line 987: 
> might be clearer just to return Status::OK() here, and avoid the need for a
Done


PS3, Line 991: nsert_exec_status.per_pa
> let's remove this. I can see the race where IsDone() becomes true after lin
I filed IMPALA-5807


-- 
To view, visit http://gerrit.cloudera.org:8080/7577
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1528661e5df6d9732ebfeb414576c82ec5c92241
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5749: coordinator race hits DCHECK 'num remaining backends > 0'

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Hello Henry Robinson, Michael Ho, Sailesh Mukil,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7577

to look at the new patch set (#4).

Change subject: IMPALA-5749: coordinator race hits DCHECK 'num_remaining_backends_ > 0'
......................................................................

IMPALA-5749: coordinator race hits DCHECK 'num_remaining_backends_ > 0'

In Coordinator::UpdateBackendExecStatus(), we check if the backend
has already completed with BackendState::IsDone() and return without
applying the update if so to avoid updating num_remaining_backends_
twice for the same completed backend.

The problem is that the value of BackendState::IsDone() is updated by
the call to BackendState::ApplyExecStatusReport() that comes after it,
but these operations are not performed atomically, so if there are
two simultaneous calls to UpdateBackendExecStatus(), they can both
call IsDone(), both get 'false', and then proceed to erroneously both
update num_remaining_backends_, hitting a DCHECK.

This patch modifies ApplyExecStatusReport to return true iff this
report transitioned the backend to a done status, and then only
updates num_remaining_backends_ in this case, ensuring it is only
updated once per backend.

Testing:
- Ran test_finst_cancel_when_query_complete 10,000 times without
  hitting the DCHECK (previously, it would hit about once per 300
  runs).

Change-Id: I1528661e5df6d9732ebfeb414576c82ec5c92241
---
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M be/src/runtime/coordinator.cc
3 files changed, 34 insertions(+), 34 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/77/7577/4
-- 
To view, visit http://gerrit.cloudera.org:8080/7577
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1528661e5df6d9732ebfeb414576c82ec5c92241
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>

[Impala-ASF-CR] IMPALA-5749: coordinator race hits DCHECK 'num remaining backends > 0'

Posted by "Henry Robinson (Code Review)" <ge...@cloudera.org>.
Henry Robinson has posted comments on this change.

Change subject: IMPALA-5749: coordinator race hits DCHECK 'num_remaining_backends_ > 0'
......................................................................


Patch Set 1:

Does this trigger only when there are two concurrent calls to UpdateBackendExecStatus() from the same backend? If so, do we understand why that happens so often?

-- 
To view, visit http://gerrit.cloudera.org:8080/7577
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1528661e5df6d9732ebfeb414576c82ec5c92241
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5749: coordinator race hits DCHECK 'num remaining backends > 0'

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-5749: coordinator race hits DCHECK 'num_remaining_backends_ > 0'
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7577/4/be/src/runtime/coordinator-backend-state.h
File be/src/runtime/coordinator-backend-state.h:

Line 113:   bool IsDone();
does it still make sense to expose that?  as this bug shows, seems better to not expose it outside this class if there's a good refactoring of the code that allows that.


-- 
To view, visit http://gerrit.cloudera.org:8080/7577
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1528661e5df6d9732ebfeb414576c82ec5c92241
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5749: coordinator race hits DCHECK 'num remaining backends > 0'

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has posted comments on this change.

Change subject: IMPALA-5749: coordinator race hits DCHECK 'num_remaining_backends_ > 0'
......................................................................


Patch Set 2:

(1 comment)

> > Does this trigger only when there are two concurrent calls to
 > > UpdateBackendExecStatus() from the same backend? If so, do we
 > > understand why that happens so often?
 > 
 > My understanding is this:
 > A fragment instance sends reports every 'n' seconds. Due to a
 > congested network, two of these reports for the same fragment
 > instance from a backend can arrive at the coordinator and start
 > being processed at around the same time, hence leading to this
 > issue.
 > 
 > Ideally a second report cannot be send until the first one is ACKd
 > by the coordinator, since a lock is held until the report is ACKd,
 > in the ReportProfileThread(); but there is only one case where a
 > second report will be sent before the first one is responded to,
 > i.e.  from FragmentInstanceState::Finalize().
 > 
 > So ReportProfileThread() sends the one report of the last
 > finstance, then Finalize() sends the second report of the same
 > finstance before the first one is responded to.

This may be possible, but the query I've been using to repro it runs fast enough that there are never any updates sent by the report thread, just from Finalize, so I haven't observed it.

The reason I've been seeing it fail is: there are two different fragments from the same backend that send reports at the same time. When they both reach Coordinator::UpdateBackendExecStatus(), they both get past the IsDone() check.

Then ApplyExecStatusReport() gets called twice but both calls return true in the out parameter 'done', because the first call contained a fragment with an error status which causes the backend to be considered done, so both threads end up decreasing num_remaining_backends_ for the same backend.

http://gerrit.cloudera.org:8080/#/c/7577/1/be/src/runtime/coordinator-backend-state.h
File be/src/runtime/coordinator-backend-state.h:

Line 75:   /// this update was not applied because this backend has previously completed. This can
> Please also add:
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7577
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1528661e5df6d9732ebfeb414576c82ec5c92241
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5749: coordinator race hits DCHECK 'num remaining backends > 0'

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has posted comments on this change.

Change subject: IMPALA-5749: coordinator race hits DCHECK 'num_remaining_backends_ > 0'
......................................................................


Patch Set 2: Code-Review+1

Carrying forward +1

-- 
To view, visit http://gerrit.cloudera.org:8080/7577
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1528661e5df6d9732ebfeb414576c82ec5c92241
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5749: coordinator race hits DCHECK 'num remaining backends > 0'

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-5749: coordinator race hits DCHECK 'num_remaining_backends_ > 0'
......................................................................


Patch Set 5: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/7577
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1528661e5df6d9732ebfeb414576c82ec5c92241
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-HasComments: No