You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Henry Robinson (Code Review)" <ge...@cloudera.org> on 2016/10/27 05:11:41 UTC

[Impala-ASF-CR] IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation

Henry Robinson has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/4865

Change subject: IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation
......................................................................

IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation

This patch fixes two problems:

1. If a fragment instance is cancelled (by the coordinator) concurrently
to the coordinator calling GetNext() on its PlanRootSink, GetNext() may
access a destroyed PlanRootSink object and crash (IMPALA-4333).

2. If ExecPlanFragment() times out, but is successful, the coordinator
will never call CloseConsumer() to release its side of the PlanRootSink,
and the fragment instance will never finish (IMPALA-4348).

The following changes address this:

* Make PlanFragmentExecutor::Cancel() call CloseConsumer() on its root
  sink. This ensures that, no matter what the reason, the fragment
  instance will eventually terminate.

* Ensure that it is safe to call CloseConsumer() concurrently to
  PlanRootSink::GetNext().

* Ensure that the PlanRootSink has a lifetime at least as long as the
  coordinator object by taking a shared_ptr reference to its parent
  FragmentExecState object. Even if the fragment instance finishes, the
  object will still be valid until the coordinator releases this
  reference. In the future, we expect this to be replaced by an explicit
  reference take / relinquish API.

* Ensure the coordinator tries to cancel fragment instances if any
  attempt was made to start them, not just if the RPC was a
  success. This deals with the timeout case where the RPC was slow, but
  successful.

This patch relies on a fix for IMPALA-4383, which ensures that fragment
instances will always try to send reports, and so will always detect an
error and cancel themselves.

Testing: Ran TestRPCTimeout.test_execplanfragment_timeout in a loop for
90 minutes. Patch addresses hangs and failure modes that were observed
previously.

Change-Id: I5b4e25c1d658b3929182ba5e56b5c5e881dd394a
---
M be/src/exec/plan-root-sink.cc
M be/src/exec/plan-root-sink.h
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M be/src/runtime/plan-fragment-executor.cc
5 files changed, 37 insertions(+), 23 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/65/4865/1
-- 
To view, visit http://gerrit.cloudera.org:8080/4865
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I5b4e25c1d658b3929182ba5e56b5c5e881dd394a
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson <he...@cloudera.com>

[Impala-ASF-CR] IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation

Posted by "Henry Robinson (Code Review)" <ge...@cloudera.org>.
Henry Robinson has posted comments on this change.

Change subject: IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4865/1/be/src/exec/plan-root-sink.cc
File be/src/exec/plan-root-sink.cc:

Line 163:     consumer_cv_.wait(l);
> I guess I don't understand why the changes to PlanRootSink are required the
Agree it's not needed for correctness. consumer_cv_wait_ can wake up spuriously, so it could be worth checking here.

But perhaps CloseConsumer() should also signal consumer_cv_.


-- 
To view, visit http://gerrit.cloudera.org:8080/4865
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5b4e25c1d658b3929182ba5e56b5c5e881dd394a
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation

Posted by "Henry Robinson (Code Review)" <ge...@cloudera.org>.
Henry Robinson has uploaded a new patch set (#2).

Change subject: IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation
......................................................................

IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation

This patch fixes two problems:

1. If a fragment instance is cancelled (by the coordinator) concurrently
to the coordinator calling GetNext() on its PlanRootSink, GetNext() may
access a destroyed PlanRootSink object and crash (IMPALA-4333).

2. If ExecPlanFragment() times out, but is successful, the coordinator
will never call CloseConsumer() to release its side of the PlanRootSink,
and the fragment instance will never finish (IMPALA-4348).

The following changes address this:

* Make PlanFragmentExecutor::Cancel() call CloseConsumer() on its root
  sink. This ensures that, no matter what the reason, the fragment
  instance will eventually terminate.

* Ensure that it is safe to call CloseConsumer() concurrently to
  PlanRootSink::GetNext().

* Ensure that the PlanRootSink has a lifetime at least as long as the
  coordinator object by taking a shared_ptr reference to its parent
  FragmentExecState object. Even if the fragment instance finishes, the
  object will still be valid until the coordinator releases this
  reference. In the future, we expect this to be replaced by an explicit
  reference take / relinquish API.

* Ensure the coordinator tries to cancel fragment instances if any
  attempt was made to start them, not just if the RPC was a
  success. This deals with the timeout case where the RPC was slow, but
  successful.

This patch relies on a fix for IMPALA-4383, which ensures that fragment
instances will always try to send reports, and so will always detect an
error and cancel themselves.

Testing: Ran TestRPCTimeout.test_execplanfragment_timeout in a loop for
90 minutes. Patch addresses hangs and failure modes that were observed
previously.

Change-Id: I5b4e25c1d658b3929182ba5e56b5c5e881dd394a
---
M be/src/exec/plan-root-sink.cc
M be/src/exec/plan-root-sink.h
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M be/src/runtime/plan-fragment-executor.cc
5 files changed, 44 insertions(+), 25 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/65/4865/2
-- 
To view, visit http://gerrit.cloudera.org:8080/4865
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I5b4e25c1d658b3929182ba5e56b5c5e881dd394a
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4865/1/be/src/exec/plan-root-sink.cc
File be/src/exec/plan-root-sink.cc:

Line 163:     consumer_cv_.wait(l);
As far as I can see, PlanRootSink can still be destroyed while a thread A is stuck in here.

If CloseConsumer() is called concurrently by thread B while the consumer thread is waiting here, consumer_done_ is set to true, and thread A remains blocked here.

Then if Close() is called by a different thread C, consumer_done_ is true so it returns immediately and can start destroying this. Close() signals 'consumer_cv_' so this thread might get luck and get out of here in time, but I don't see anything that guarantees that.


-- 
To view, visit http://gerrit.cloudera.org:8080/4865
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5b4e25c1d658b3929182ba5e56b5c5e881dd394a
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation

Posted by "Henry Robinson (Code Review)" <ge...@cloudera.org>.
Henry Robinson has posted comments on this change.

Change subject: IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4865/1/be/src/exec/plan-root-sink.cc
File be/src/exec/plan-root-sink.cc:

Line 163:     consumer_cv_.wait(l);
> Now that CloseConsumer() signals consumer_cv_ (which I think is an appropri
Sorry, old comment - superseded by the code.


-- 
To view, visit http://gerrit.cloudera.org:8080/4865
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5b4e25c1d658b3929182ba5e56b5c5e881dd394a
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation

Posted by "Henry Robinson (Code Review)" <ge...@cloudera.org>.
Henry Robinson has submitted this change and it was merged.

Change subject: IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation
......................................................................


IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation

This patch fixes two problems:

1. If a fragment instance is cancelled (by the coordinator) concurrently
to the coordinator calling GetNext() on its PlanRootSink, GetNext() may
access a destroyed PlanRootSink object and crash (IMPALA-4333).

2. If ExecPlanFragment() times out, but is successful, the coordinator
will never call CloseConsumer() to release its side of the PlanRootSink,
and the fragment instance will never finish (IMPALA-4348).

The following changes address this:

* Make PlanFragmentExecutor::Cancel() call CloseConsumer() on its root
  sink. This ensures that, no matter what the reason, the fragment
  instance will eventually terminate. Otherwise the coordinator may not
  call CloseConsumer() if it could not access the root instance's
  FragmentExecState.

* Ensure that the PlanRootSink has a lifetime at least as long as the
  coordinator object by taking a shared_ptr reference to its parent
  FragmentExecState object. Even if the fragment instance finishes, the
  object will still be valid until the coordinator releases this
  reference. In the future, we expect this to be replaced by an explicit
  reference take / relinquish API.

* Ensure the coordinator tries to cancel fragment instances if any
  attempt was made to start them, not just if the RPC was a
  success. This deals with the timeout case where the RPC was slow, but
  successful.

This patch relies on a fix for IMPALA-4383, which ensures that fragment
instances will always try to send reports, and so will always detect an
error and cancel themselves.

Testing:

* Ran TestRPCTimeout.test_execplanfragment_timeout in a loop for 90
minutes. Patch addresses hangs and failure modes that were observed
previously.

* Added a manual sleep just before Coordinator::GetNext() calls
  root_sink_->GetNext(), and cancelled a query manually in that
  window. Confirmed that query cancels successfully.

Change-Id: I5b4e25c1d658b3929182ba5e56b5c5e881dd394a
Reviewed-on: http://gerrit.cloudera.org:8080/4865
Tested-by: Internal Jenkins
Reviewed-by: Henry Robinson <he...@cloudera.com>
---
M be/src/exec/plan-root-sink.cc
M be/src/exec/plan-root-sink.h
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M be/src/runtime/plan-fragment-executor.cc
5 files changed, 65 insertions(+), 35 deletions(-)

Approvals:
  Henry Robinson: Looks good to me, approved
  Internal Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/4865
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I5b4e25c1d658b3929182ba5e56b5c5e881dd394a
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4865/1/be/src/exec/plan-root-sink.cc
File be/src/exec/plan-root-sink.cc:

Line 156:   if (consumer_done_) return state->CheckQueryState();
> sure, fine to leave out now, but please change in a follow-up patch.
For me, it's less comprehensible to have this check because it is racy -- the lock is dropped at line 163 -- and so consumer_done_ can get set in the middle of this routine anyway.  We can keep the consumer_done_ check on line 162 to deal with that, but then we have two possible tear-down sequences rather than one (one where we notice sender_done_ is set and the other when we notice consumer_done_ is set), so to me that harder to reason about (and get test coverage for both).


-- 
To view, visit http://gerrit.cloudera.org:8080/4865
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5b4e25c1d658b3929182ba5e56b5c5e881dd394a
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation

Posted by "Henry Robinson (Code Review)" <ge...@cloudera.org>.
Hello Tim Armstrong, Dan Hecht,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/4865

to look at the new patch set (#6).

Change subject: IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation
......................................................................

IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation

This patch fixes two problems:

1. If a fragment instance is cancelled (by the coordinator) concurrently
to the coordinator calling GetNext() on its PlanRootSink, GetNext() may
access a destroyed PlanRootSink object and crash (IMPALA-4333).

2. If ExecPlanFragment() times out, but is successful, the coordinator
will never call CloseConsumer() to release its side of the PlanRootSink,
and the fragment instance will never finish (IMPALA-4348).

The following changes address this:

* Make PlanFragmentExecutor::Cancel() call CloseConsumer() on its root
  sink. This ensures that, no matter what the reason, the fragment
  instance will eventually terminate. Otherwise the coordinator may not
  call CloseConsumer() if it could not access the root instance's
  FragmentExecState.

* Ensure that the PlanRootSink has a lifetime at least as long as the
  coordinator object by taking a shared_ptr reference to its parent
  FragmentExecState object. Even if the fragment instance finishes, the
  object will still be valid until the coordinator releases this
  reference. In the future, we expect this to be replaced by an explicit
  reference take / relinquish API.

* Ensure the coordinator tries to cancel fragment instances if any
  attempt was made to start them, not just if the RPC was a
  success. This deals with the timeout case where the RPC was slow, but
  successful.

This patch relies on a fix for IMPALA-4383, which ensures that fragment
instances will always try to send reports, and so will always detect an
error and cancel themselves.

Testing:

* Ran TestRPCTimeout.test_execplanfragment_timeout in a loop for 90
minutes. Patch addresses hangs and failure modes that were observed
previously.

* Added a manual sleep just before Coordinator::GetNext() calls
  root_sink_->GetNext(), and cancelled a query manually in that
  window. Confirmed that query cancels successfully.

Change-Id: I5b4e25c1d658b3929182ba5e56b5c5e881dd394a
---
M be/src/exec/plan-root-sink.cc
M be/src/exec/plan-root-sink.h
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M be/src/runtime/plan-fragment-executor.cc
5 files changed, 65 insertions(+), 35 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/65/4865/6
-- 
To view, visit http://gerrit.cloudera.org:8080/4865
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I5b4e25c1d658b3929182ba5e56b5c5e881dd394a
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation

Posted by "Henry Robinson (Code Review)" <ge...@cloudera.org>.
Henry Robinson has posted comments on this change.

Change subject: IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation
......................................................................


Patch Set 1:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/4865/1/be/src/exec/plan-root-sink.cc
File be/src/exec/plan-root-sink.cc:

Line 163:     consumer_cv_.wait(l);
> As far as I can see, PlanRootSink can still be destroyed while a thread A i
You're right, but the lifetime of the PRS is managed by the consumer and the sender, which coordinate out-of-scope here. 

In this patch we do that via the shared_ptr to the fragment instance.


http://gerrit.cloudera.org:8080/#/c/4865/1/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

Line 520:       return Status(Substitute("Root fragment instance ($0) failed", PrintId(query_id_)));
> would the query ultimately fail with this status or with the status returne
The query will fail with this error message. We're in a tricky position here - do we wait for an error report from the fragment, or do we return quickly? Waiting seems a bad idea to me.

What we could do is try and read the status of the root fragment - if it is in error, we'll return it, otherwise we'll return the more generic message.


Line 1196:     // CloseConsumer() here). See IMPALA-4275 for details.
> is this still accurate? if we moved this, would that eliminate the need to 
TearDown() also calls CloseConsumer() - so I think that moving this call wouldn't change whether we need PFE::Cancel() to call it as well.


http://gerrit.cloudera.org:8080/#/c/4865/1/be/src/runtime/plan-fragment-executor.cc
File be/src/runtime/plan-fragment-executor.cc:

Line 509:   }
> nit:could be one line
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/4865
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5b4e25c1d658b3929182ba5e56b5c5e881dd394a
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation
......................................................................


Patch Set 4: Code-Review+1

-- 
To view, visit http://gerrit.cloudera.org:8080/4865
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5b4e25c1d658b3929182ba5e56b5c5e881dd394a
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation

Posted by "Henry Robinson (Code Review)" <ge...@cloudera.org>.
Henry Robinson has posted comments on this change.

Change subject: IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4865/1/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

PS1, Line 519: !=
note: this is obviously wrong; didn't properly stage that fix in this patchset.


-- 
To view, visit http://gerrit.cloudera.org:8080/4865
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5b4e25c1d658b3929182ba5e56b5c5e881dd394a
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4865/1/be/src/exec/plan-root-sink.cc
File be/src/exec/plan-root-sink.cc:

Line 163:     consumer_cv_.wait(l);
> You're right, but the lifetime of the PRS is managed by the consumer and th
I guess I don't understand why the changes to PlanRootSink are required then.

It seems like ultimately we're depending on Close() waking up this thread anyway after it sets sender_done_, so why do we need to check consumer_done_ in this loop?


-- 
To view, visit http://gerrit.cloudera.org:8080/4865
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5b4e25c1d658b3929182ba5e56b5c5e881dd394a
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation

Posted by "Internal Jenkins (Code Review)" <ge...@cloudera.org>.
Internal Jenkins has posted comments on this change.

Change subject: IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation
......................................................................


Patch Set 5: Verified-1

Build failed: http://sandbox.jenkins.cloudera.com/job/impala-external-gerrit-verify-merge-ASF/375/

-- 
To view, visit http://gerrit.cloudera.org:8080/4865
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5b4e25c1d658b3929182ba5e56b5c5e881dd394a
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation

Posted by "Henry Robinson (Code Review)" <ge...@cloudera.org>.
Henry Robinson has uploaded a new patch set (#3).

Change subject: IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation
......................................................................

IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation

This patch fixes two problems:

1. If a fragment instance is cancelled (by the coordinator) concurrently
to the coordinator calling GetNext() on its PlanRootSink, GetNext() may
access a destroyed PlanRootSink object and crash (IMPALA-4333).

2. If ExecPlanFragment() times out, but is successful, the coordinator
will never call CloseConsumer() to release its side of the PlanRootSink,
and the fragment instance will never finish (IMPALA-4348).

The following changes address this:

* Make PlanFragmentExecutor::Cancel() call CloseConsumer() on its root
  sink. This ensures that, no matter what the reason, the fragment
  instance will eventually terminate. Otherwise the coordinator may not
  call CloseConsumer() if it could not access the root instance's
  FragmentExecState.

* Ensure that the PlanRootSink has a lifetime at least as long as the
  coordinator object by taking a shared_ptr reference to its parent
  FragmentExecState object. Even if the fragment instance finishes, the
  object will still be valid until the coordinator releases this
  reference. In the future, we expect this to be replaced by an explicit
  reference take / relinquish API.

* Ensure the coordinator tries to cancel fragment instances if any
  attempt was made to start them, not just if the RPC was a
  success. This deals with the timeout case where the RPC was slow, but
  successful.

This patch relies on a fix for IMPALA-4383, which ensures that fragment
instances will always try to send reports, and so will always detect an
error and cancel themselves.

Testing:

* Ran TestRPCTimeout.test_execplanfragment_timeout in a loop for 90
minutes. Patch addresses hangs and failure modes that were observed
previously.

* Added a manual sleep just before Coordinator::GetNext() calls
  root_sink_->GetNext(), and cancelled a query manually in that
  window. Confirmed that query cancels successfully.

Change-Id: I5b4e25c1d658b3929182ba5e56b5c5e881dd394a
---
M be/src/exec/plan-root-sink.cc
M be/src/exec/plan-root-sink.h
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M be/src/runtime/plan-fragment-executor.cc
5 files changed, 62 insertions(+), 35 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/65/4865/3
-- 
To view, visit http://gerrit.cloudera.org:8080/4865
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I5b4e25c1d658b3929182ba5e56b5c5e881dd394a
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation

Posted by "Henry Robinson (Code Review)" <ge...@cloudera.org>.
Henry Robinson has uploaded a new patch set (#4).

Change subject: IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation
......................................................................

IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation

This patch fixes two problems:

1. If a fragment instance is cancelled (by the coordinator) concurrently
to the coordinator calling GetNext() on its PlanRootSink, GetNext() may
access a destroyed PlanRootSink object and crash (IMPALA-4333).

2. If ExecPlanFragment() times out, but is successful, the coordinator
will never call CloseConsumer() to release its side of the PlanRootSink,
and the fragment instance will never finish (IMPALA-4348).

The following changes address this:

* Make PlanFragmentExecutor::Cancel() call CloseConsumer() on its root
  sink. This ensures that, no matter what the reason, the fragment
  instance will eventually terminate. Otherwise the coordinator may not
  call CloseConsumer() if it could not access the root instance's
  FragmentExecState.

* Ensure that the PlanRootSink has a lifetime at least as long as the
  coordinator object by taking a shared_ptr reference to its parent
  FragmentExecState object. Even if the fragment instance finishes, the
  object will still be valid until the coordinator releases this
  reference. In the future, we expect this to be replaced by an explicit
  reference take / relinquish API.

* Ensure the coordinator tries to cancel fragment instances if any
  attempt was made to start them, not just if the RPC was a
  success. This deals with the timeout case where the RPC was slow, but
  successful.

This patch relies on a fix for IMPALA-4383, which ensures that fragment
instances will always try to send reports, and so will always detect an
error and cancel themselves.

Testing:

* Ran TestRPCTimeout.test_execplanfragment_timeout in a loop for 90
minutes. Patch addresses hangs and failure modes that were observed
previously.

* Added a manual sleep just before Coordinator::GetNext() calls
  root_sink_->GetNext(), and cancelled a query manually in that
  window. Confirmed that query cancels successfully.

Change-Id: I5b4e25c1d658b3929182ba5e56b5c5e881dd394a
---
M be/src/exec/plan-root-sink.cc
M be/src/exec/plan-root-sink.h
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M be/src/runtime/plan-fragment-executor.cc
5 files changed, 60 insertions(+), 35 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/65/4865/4
-- 
To view, visit http://gerrit.cloudera.org:8080/4865
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I5b4e25c1d658b3929182ba5e56b5c5e881dd394a
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation

Posted by "Henry Robinson (Code Review)" <ge...@cloudera.org>.
Henry Robinson has posted comments on this change.

Change subject: IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation
......................................................................


Patch Set 1:

(Just flushing comments to keep discussion going, new patch coming soon).

-- 
To view, visit http://gerrit.cloudera.org:8080/4865
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5b4e25c1d658b3929182ba5e56b5c5e881dd394a
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation

Posted by "Henry Robinson (Code Review)" <ge...@cloudera.org>.
Henry Robinson has posted comments on this change.

Change subject: IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation
......................................................................


Patch Set 2:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/4865/2//COMMIT_MSG
Commit Message:

Line 13: access a destroyed PlanRootSink object and crash (IMPALA-4333).
> did this bug reproduce in any tests, or was it only caught by Hue?
Only caught by Hue. Needs some kind of... fault injection framework? That sounds like an idea.

I tested it manually.


Line 25: * Ensure that it is safe to call CloseConsumer() concurrently to
> If I understand correctly there weren't any correctness fixes in this part,
Done


http://gerrit.cloudera.org:8080/#/c/4865/1/be/src/exec/plan-root-sink.cc
File be/src/exec/plan-root-sink.cc:

Line 163: 
> Yeah, I think we should be clearer what's necessary for correctness and wha
Now that CloseConsumer() signals consumer_cv_ (which I think is an appropriate functional change), I think this is relevant to the patch (otherwise I agree, we should remove it).


http://gerrit.cloudera.org:8080/#/c/4865/2/be/src/exec/plan-root-sink.h
File be/src/exec/plan-root-sink.h:

Line 79:   /// that calls Send(). *eos is set to 'true' when there are no more rows to consume.
> When does it block/unblock?
Done


http://gerrit.cloudera.org:8080/#/c/4865/2/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

PS2, Line 166:  rpc_sent = true
> We don't seem to do anything with this argument.
Woops, thanks. It's ok for rpc_sent_ to be incorrectly set to true - it's not an error to try to cancel non-existant plan fragments - but it does prevent warnings in the logs.


PS2, Line 517: might have been cancelled
> what does it mean to "have been cancelled"?  Since line 478 gets the lock, 
Failed is probably better, yep.


Line 523:       // Try and return the fragment instance status if it was already set.
> Are there any situations where the fragment failed and this isn't set? Woul
Reporting the failure happens asynchronously - I think it's very unlikely to be an issue, but there's nothing guaranteeing that the error is visible to this thread even though the fragment has finished.


-- 
To view, visit http://gerrit.cloudera.org:8080/4865
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5b4e25c1d658b3929182ba5e56b5c5e881dd394a
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation
......................................................................


Patch Set 1:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/4865/1/be/src/exec/plan-root-sink.h
File be/src/exec/plan-root-sink.h:

PS1, Line 84: after
the commit message indicates it can be called concurrently as well...


http://gerrit.cloudera.org:8080/#/c/4865/1/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

Line 520:       return Status(Substitute("Root fragment instance ($0) failed", PrintId(query_id_)));
would the query ultimately fail with this status or with the status returned by the remote Prepare() failure?  Is this a change of behavior and will it now be harder to know the real reason the query failed?


Line 525:     // fragment instance's executor will not complete until that point.
is this comment still accurate or does it need refining for the cancel case?


Line 1196:     // CloseConsumer() here). See IMPALA-4275 for details.
is this still accurate? if we moved this, would that eliminate the need to call CloseConsumer in PFE::Cancel()?


http://gerrit.cloudera.org:8080/#/c/4865/1/be/src/runtime/plan-fragment-executor.cc
File be/src/runtime/plan-fragment-executor.cc:

Line 509:   }
nit:could be one line


-- 
To view, visit http://gerrit.cloudera.org:8080/4865
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5b4e25c1d658b3929182ba5e56b5c5e881dd394a
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation

Posted by "Henry Robinson (Code Review)" <ge...@cloudera.org>.
Henry Robinson has posted comments on this change.

Change subject: IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation
......................................................................


Patch Set 5:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/4865/1/be/src/exec/plan-root-sink.cc
File be/src/exec/plan-root-sink.cc:

Line 156:   results_ = results;
> if someone called CloseConsumer(), does it even make sense to keep executin
We decided not to include this path for this patch - reason being that we want to keep the number of ways to exit this method small, to make sure we're getting good test coverage. Returning here has no correctness benefit.


http://gerrit.cloudera.org:8080/#/c/4865/1/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

Line 1550
> leave a comment in place explaining why the cancel call is done despite a p
Done


Line 166:   /// CancelFragmentInstances() will include this instance in the set of potential
> remove default.
Done


Line 506:   }
> since they're all started at the same time now, why is the coordinator frag
The coordinator is the only one we need to worry about closing, so we don't need to do it ourselves. Updated comment.


Line 522:       FragmentInstanceState* root_state = fragment_instance_states_[0];
> get rid of executor_ and root_sink_ and add executor() and root_sink() that
Will do in follow-on patch - that has a couple of subtle effects that I don't want to worry about for this fix.


-- 
To view, visit http://gerrit.cloudera.org:8080/4865
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5b4e25c1d658b3929182ba5e56b5c5e881dd394a
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation
......................................................................


Patch Set 2:

(6 comments)

Starting to feel a little more comfortable with it. Probably needs some more thought but I did a full pass.

http://gerrit.cloudera.org:8080/#/c/4865/2//COMMIT_MSG
Commit Message:

Line 25: * Ensure that it is safe to call CloseConsumer() concurrently to
If I understand correctly there weren't any correctness fixes in this part, just some things to wake up the GetNext() thread sooner, right?


http://gerrit.cloudera.org:8080/#/c/4865/1/be/src/exec/plan-root-sink.cc
File be/src/exec/plan-root-sink.cc:

Line 163: 
> Agree it's not needed for correctness. consumer_cv_wait_ can wake up spurio
Yeah, I think we should be clearer what's necessary for correctness and what is an optimisation.


http://gerrit.cloudera.org:8080/#/c/4865/2/be/src/exec/plan-root-sink.h
File be/src/exec/plan-root-sink.h:

Line 79:   /// that calls Send(). *eos is set to 'true' when there are no more rows to consume.
When does it block/unblock?


http://gerrit.cloudera.org:8080/#/c/4865/2/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

PS2, Line 166:  rpc_sent = true
We don't seem to do anything with this argument.

Also, IMO probably better to make it a non-default argument since the default arg adds additional subtlety to something that's already fairly subtle.


Line 523:       // Try and return the fragment instance status if it was already set.
Are there any situations where the fragment failed and this isn't set? Would be good to be clear about that. It seems a little strange that it would disappear without leaving an error, but is that possible with cancellation?


http://gerrit.cloudera.org:8080/#/c/4865/2/be/src/runtime/plan-fragment-executor.cc
File be/src/runtime/plan-fragment-executor.cc:

Line 507:   if (root_sink_ != nullptr) root_sink_->CloseConsumer();
It's a little hard to understand why the producer-side is calling CloseConsumer() without reading the commit message - probably a comment is needed.


-- 
To view, visit http://gerrit.cloudera.org:8080/4865
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5b4e25c1d658b3929182ba5e56b5c5e881dd394a
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation

Posted by "Internal Jenkins (Code Review)" <ge...@cloudera.org>.
Internal Jenkins has posted comments on this change.

Change subject: IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation
......................................................................


Patch Set 6: Verified+1

-- 
To view, visit http://gerrit.cloudera.org:8080/4865
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5b4e25c1d658b3929182ba5e56b5c5e881dd394a
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation

Posted by "Henry Robinson (Code Review)" <ge...@cloudera.org>.
Henry Robinson has posted comments on this change.

Change subject: IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation
......................................................................


Patch Set 6: Code-Review+2

Carry +2

-- 
To view, visit http://gerrit.cloudera.org:8080/4865
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5b4e25c1d658b3929182ba5e56b5c5e881dd394a
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation

Posted by "Henry Robinson (Code Review)" <ge...@cloudera.org>.
Henry Robinson has posted comments on this change.

Change subject: IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4865/3/be/src/exec/plan-root-sink.cc
File be/src/exec/plan-root-sink.cc:

Line 156:   if (consumer_done_) return state->CheckQueryState();
> is this needed? won't the sender_done_ flag also be set soon enough, and be
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/4865
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5b4e25c1d658b3929182ba5e56b5c5e881dd394a
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4865/2/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

PS2, Line 517: might have been cancelled
what does it mean to "have been cancelled"?  Since line 478 gets the lock, we wouldn't have cancelled it yet.  Do you mean, "might have failed ..."?


-- 
To view, visit http://gerrit.cloudera.org:8080/4865
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5b4e25c1d658b3929182ba5e56b5c5e881dd394a
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation

Posted by "Marcel Kornacker (Code Review)" <ge...@cloudera.org>.
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation
......................................................................


Patch Set 1:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/4865/1/be/src/exec/plan-root-sink.cc
File be/src/exec/plan-root-sink.cc:

Line 156:   if (consumer_done_) return state->CheckQueryState();
if someone called CloseConsumer(), does it even make sense to keep executing this function? shouldn't you just return eos at this point?


Line 163:     consumer_cv_.wait(l);
> Agree it's not needed for correctness. consumer_cv_wait_ can wake up spurio
make that part of the declared interface (if you call CloseConsumer() then GetNext() will immediately exit with eos).


http://gerrit.cloudera.org:8080/#/c/4865/1/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

Line 1550
leave a comment in place explaining why the cancel call is done despite a possible error status


Line 166:   void SetInitialStatus(const Status& status, bool rpc_sent = true) {
remove default.


Line 506:   // If there was an error, but the coordinator fragment was successfully started, it
since they're all started at the same time now, why is the coordinator fragment still special?


Line 522:     executor_ = root_fragment_instance_->executor();
get rid of executor_ and root_sink_ and add executor() and root_sink() that go through root_fragment_instance_, to make clear that those are only valid as long as root_fragment_instance_ is valid.


-- 
To view, visit http://gerrit.cloudera.org:8080/4865
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5b4e25c1d658b3929182ba5e56b5c5e881dd394a
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation

Posted by "Internal Jenkins (Code Review)" <ge...@cloudera.org>.
Internal Jenkins has posted comments on this change.

Change subject: IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation
......................................................................


Patch Set 6: Verified-1

Build failed: http://sandbox.jenkins.cloudera.com/job/impala-external-gerrit-verify-merge-ASF/376/

-- 
To view, visit http://gerrit.cloudera.org:8080/4865
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5b4e25c1d658b3929182ba5e56b5c5e881dd394a
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4865/3/be/src/exec/plan-root-sink.cc
File be/src/exec/plan-root-sink.cc:

Line 156:   if (consumer_done_) return state->CheckQueryState();
is this needed? won't the sender_done_ flag also be set soon enough, and before consumer_cv_ is notified?


-- 
To view, visit http://gerrit.cloudera.org:8080/4865
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5b4e25c1d658b3929182ba5e56b5c5e881dd394a
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation
......................................................................


Patch Set 2:

(1 comment)

I'm also not sure why we need the changes to plan-root-sink (other than removing the dcheck). let's chat in person.

http://gerrit.cloudera.org:8080/#/c/4865/2//COMMIT_MSG
Commit Message:

Line 13: access a destroyed PlanRootSink object and crash (IMPALA-4333).
did this bug reproduce in any tests, or was it only caught by Hue?


-- 
To view, visit http://gerrit.cloudera.org:8080/4865
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5b4e25c1d658b3929182ba5e56b5c5e881dd394a
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation
......................................................................


Patch Set 4: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/4865
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5b4e25c1d658b3929182ba5e56b5c5e881dd394a
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation

Posted by "Henry Robinson (Code Review)" <ge...@cloudera.org>.
Henry Robinson has posted comments on this change.

Change subject: IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4865/2/be/src/runtime/plan-fragment-executor.cc
File be/src/runtime/plan-fragment-executor.cc:

Line 507:   if (root_sink_ != nullptr) root_sink_->CloseConsumer();
> It's a little hard to understand why the producer-side is calling CloseCons
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/4865
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5b4e25c1d658b3929182ba5e56b5c5e881dd394a
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation

Posted by "Marcel Kornacker (Code Review)" <ge...@cloudera.org>.
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4865/1/be/src/exec/plan-root-sink.cc
File be/src/exec/plan-root-sink.cc:

Line 156:   if (consumer_done_) return state->CheckQueryState();
> We decided not to include this path for this patch - reason being that we w
sure, fine to leave out now, but please change in a follow-up patch.

there's no correctness benefit, but it has a comprehensibility benefit (because i don't have to reason about why exactly you have to run through all of this if you're already closed).


-- 
To view, visit http://gerrit.cloudera.org:8080/4865
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5b4e25c1d658b3929182ba5e56b5c5e881dd394a
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes