You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Pranay Singh (Code Review)" <ge...@cloudera.org> on 2018/07/23 21:52:17 UTC

[Impala-ASF-CR] IMPALA-6214: Determine and warn about stuck fragment instances.

Pranay Singh has uploaded this change for review. ( http://gerrit.cloudera.org:8080/11021


Change subject: IMPALA-6214: Determine and warn about stuck fragment instances.
......................................................................

IMPALA-6214: Determine and warn about stuck fragment instances.

In order to diagnose query hangs, we need to know the fragment execution
time on a particular exec node. Inspecting the query run time profile
to find the cause of hang does not give much details.

This change helps in finding the problematic 'exec node' where the fragment
execution is not making progress. The change makes use of kudu watchdog that
periodically polls and prints the delay in response from an exec node.

Testing:
--------
a) Added a delay on the sender side as a part of manual test case to notice
   the affect of change. The watchdog prints the detail of fragmentID and
   nodeID when the watchdog timer expires.

b) Ran the core test without failure.

Change-Id: I260a1d0a3477e5c6a46094e664500c3e2ed7de62
---
M be/src/common/global-flags.cc
M be/src/runtime/krpc-data-stream-recvr.cc
2 files changed, 27 insertions(+), 1 deletion(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/21/11021/1
-- 
To view, visit http://gerrit.cloudera.org:8080/11021
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I260a1d0a3477e5c6a46094e664500c3e2ed7de62
Gerrit-Change-Number: 11021
Gerrit-PatchSet: 1
Gerrit-Owner: Pranay Singh

[Impala-ASF-CR] IMPALA-6214: Determine and warn about stuck fragment instances.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11021 )

Change subject: IMPALA-6214: Determine and warn about stuck fragment instances.
......................................................................


Patch Set 1:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/17/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. This is experimental - please report any issues to tarmstrong@cloudera.com or on this JIRA: https://issues.apache.org/jira/browse/IMPALA-7317


-- 
To view, visit http://gerrit.cloudera.org:8080/11021
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I260a1d0a3477e5c6a46094e664500c3e2ed7de62
Gerrit-Change-Number: 11021
Gerrit-PatchSet: 1
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Mon, 23 Jul 2018 22:41:21 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6214: Determine and warn about stuck fragment instances.

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has abandoned this change. ( http://gerrit.cloudera.org:8080/11021 )

Change subject: IMPALA-6214: Determine and warn about stuck fragment instances.
......................................................................


Abandoned

Appears stale, reopen if you disagree
-- 
To view, visit http://gerrit.cloudera.org:8080/11021
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: abandon
Gerrit-Change-Id: I260a1d0a3477e5c6a46094e664500c3e2ed7de62
Gerrit-Change-Number: 11021
Gerrit-PatchSet: 2
Gerrit-Owner: Pranay Singh (320)
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Pranay Singh (320)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[Impala-ASF-CR] IMPALA-6214: Determine and warn about stuck fragment instances.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11021 )

Change subject: IMPALA-6214: Determine and warn about stuck fragment instances.
......................................................................


Patch Set 2:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/43/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/11021
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I260a1d0a3477e5c6a46094e664500c3e2ed7de62
Gerrit-Change-Number: 11021
Gerrit-PatchSet: 2
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Tue, 24 Jul 2018 23:40:29 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6214: Determine and warn about stuck fragment instances.

Posted by "Pranay Singh (Code Review)" <ge...@cloudera.org>.
Hello Todd Lipcon, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/11021

to look at the new patch set (#2).

Change subject: IMPALA-6214: Determine and warn about stuck fragment instances.
......................................................................

IMPALA-6214: Determine and warn about stuck fragment instances.

In order to diagnose query hangs, we need to know the fragment execution
time on a particular exec node. Inspecting the query run time profile
to find the cause of hang does not give much details.

This change helps in finding the problematic 'exec node' where the fragment
execution is not making progress. The change makes use of timed-wait wait
that waits for the 1 min delay if the batch request is not recieved from
the sender node.

Testing:
--------
a) Added a delay on the sender side as a part of manual test case to notice
   the affect of change. The watchdog prints the detail of fragmentID and
   nodeID when the watchdog timer expires.

b) Ran the core test without failure.

Change-Id: I260a1d0a3477e5c6a46094e664500c3e2ed7de62
---
M be/src/runtime/krpc-data-stream-recvr.cc
1 file changed, 7 insertions(+), 4 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/21/11021/2
-- 
To view, visit http://gerrit.cloudera.org:8080/11021
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I260a1d0a3477e5c6a46094e664500c3e2ed7de62
Gerrit-Change-Number: 11021
Gerrit-PatchSet: 2
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[Impala-ASF-CR] IMPALA-6214: Determine and warn about stuck fragment instances.

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/11021 )

Change subject: IMPALA-6214: Determine and warn about stuck fragment instances.
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/11021/2/be/src/runtime/krpc-data-stream-recvr.cc
File be/src/runtime/krpc-data-stream-recvr.cc:

http://gerrit.cloudera.org:8080/#/c/11021/2/be/src/runtime/krpc-data-stream-recvr.cc@238
PS2, Line 238:       VLOG_QUERY << "wait arrival fragment_instance_id="
VLOG_QUERY is on by default, so this would become very very noisy, once per wait, no? I think we'd only want to log if we have hit a timeout from the below CV wait.

Also, I don't know the context of this code quite well enough, but isn't it normal to sometimes wait for minutes on a sender? For example, if the upstream node is a full sort, or a join with a lot of slow parents then the receiver side may block for minutes or even hours before making progress. In that case, I can see surfacing this kind of information in the profile or in some query-scoped log but maybe not in the global impalad log?

I think I don't quite understand the end goal of this JIRA well enough to evaluate whether this change is a net help. Why doesn't the existing data_wait_timer already tell us that this node is the blocking culprit, and from there we can just look at the fragment graph to understand what was blocking it?



-- 
To view, visit http://gerrit.cloudera.org:8080/11021
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I260a1d0a3477e5c6a46094e664500c3e2ed7de62
Gerrit-Change-Number: 11021
Gerrit-PatchSet: 2
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Tue, 24 Jul 2018 23:36:23 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6214: Determine and warn about stuck fragment instances.

Posted by "Pranay Singh (Code Review)" <ge...@cloudera.org>.
Pranay Singh has posted comments on this change. ( http://gerrit.cloudera.org:8080/11021 )

Change subject: IMPALA-6214: Determine and warn about stuck fragment instances.
......................................................................


Patch Set 1:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/11021/1/be/src/runtime/krpc-data-stream-recvr.cc
File be/src/runtime/krpc-data-stream-recvr.cc:

http://gerrit.cloudera.org:8080/#/c/11021/1/be/src/runtime/krpc-data-stream-recvr.cc@230
PS1, Line 230:   stringstream dstid;
             : 
             :   dstid << recvr_->dest_node_id();
             :   string frgId = PrintId(recvr_->fragment_instance_id());
             :   string watchdog_str = "fragment_instance_id= " + frgId + " node= " + dstid.str();
             :   memset(buf, BUFSZ, 0);
             :   sprintf(buf, "%s : %d  %s", __FILE__, __LINE__, watchdog_str.c_str());
> constructing this once per batch seems like a big performance red flag.
I'll remove it.


http://gerrit.cloudera.org:8080/#/c/11021/1/be/src/runtime/krpc-data-stream-recvr.cc@239
PS1, Line 239:   kudu::SetStackTraceSignal(SIGRTMIN + 1);
> you should call this at init time, not once per batch. I'm not even sure it
I'll remove it.


http://gerrit.cloudera.org:8080/#/c/11021/1/be/src/runtime/krpc-data-stream-recvr.cc@266
PS1, Line 266:       data_arrival_cv_.wait(l);
> maybe it would be simpler to just use a timed-wait variant here to periodic
Thanks Todd I was thinking along the same lines, it's a good and easy fix for this case.



-- 
To view, visit http://gerrit.cloudera.org:8080/11021
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I260a1d0a3477e5c6a46094e664500c3e2ed7de62
Gerrit-Change-Number: 11021
Gerrit-PatchSet: 1
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Tue, 24 Jul 2018 20:14:24 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6214: Determine and warn about stuck fragment instances.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11021 )

Change subject: IMPALA-6214: Determine and warn about stuck fragment instances.
......................................................................


Patch Set 2:

Build Started https://jenkins.impala.io/job/gerrit-code-review-checks/43/ 

Running initial code review checks. This is experimental - please report any issues to tarmstrong@cloudera.com or on this JIRA: IMPALA-7317


-- 
To view, visit http://gerrit.cloudera.org:8080/11021
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I260a1d0a3477e5c6a46094e664500c3e2ed7de62
Gerrit-Change-Number: 11021
Gerrit-PatchSet: 2
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Tue, 24 Jul 2018 23:13:22 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6214: Determine and warn about stuck fragment instances.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11021 )

Change subject: IMPALA-6214: Determine and warn about stuck fragment instances.
......................................................................


Patch Set 1:

Build Started https://jenkins.impala.io/job/gerrit-code-review-checks/17/ 

Running initial code review checks.


-- 
To view, visit http://gerrit.cloudera.org:8080/11021
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I260a1d0a3477e5c6a46094e664500c3e2ed7de62
Gerrit-Change-Number: 11021
Gerrit-PatchSet: 1
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Mon, 23 Jul 2018 21:52:29 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6214: Determine and warn about stuck fragment instances.

Posted by "Lars Volker (Code Review)" <ge...@cloudera.org>.
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/11021 )

Change subject: IMPALA-6214: Determine and warn about stuck fragment instances.
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/11021/2/be/src/runtime/krpc-data-stream-recvr.cc
File be/src/runtime/krpc-data-stream-recvr.cc:

http://gerrit.cloudera.org:8080/#/c/11021/2/be/src/runtime/krpc-data-stream-recvr.cc@238
PS2, Line 238:       VLOG_QUERY << "wait arrival fragment_instance_id="
> VLOG_QUERY is on by default, so this would become very very noisy, once per
I don't think this is the right place for this check. We should instead perform this in or around the loop in FragmentInstanceState::ExecInternal(). We need to come up with a way to expose this on a per-query basis, either on a status page or (preferably) in the profile, so that other tools can parse this more easily. In addition, we should think about how we can prevent false posiitives, i.e. only report an instance that blocks, but is not blocked by any of its inputs.



-- 
To view, visit http://gerrit.cloudera.org:8080/11021
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I260a1d0a3477e5c6a46094e664500c3e2ed7de62
Gerrit-Change-Number: 11021
Gerrit-PatchSet: 2
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Tue, 31 Jul 2018 17:46:36 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6214: Determine and warn about stuck fragment instances.

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/11021 )

Change subject: IMPALA-6214: Determine and warn about stuck fragment instances.
......................................................................


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/11021/1/be/src/runtime/krpc-data-stream-recvr.cc
File be/src/runtime/krpc-data-stream-recvr.cc:

http://gerrit.cloudera.org:8080/#/c/11021/1/be/src/runtime/krpc-data-stream-recvr.cc@230
PS1, Line 230:   stringstream dstid;
             : 
             :   dstid << recvr_->dest_node_id();
             :   string frgId = PrintId(recvr_->fragment_instance_id());
             :   string watchdog_str = "fragment_instance_id= " + frgId + " node= " + dstid.str();
             :   memset(buf, BUFSZ, 0);
             :   sprintf(buf, "%s : %d  %s", __FILE__, __LINE__, watchdog_str.c_str());
constructing this once per batch seems like a big performance red flag.

Why not construct this string in the KrpcDataStreamRecvr? I don't know this code super well, but I thought this was constant for a given instance.

That said, I also don't know if this is even a safe usage of ScopedWatchKernelStack -- I believe it relies on its argument not ever getting freed, because there is no synchronization that ensures that the "watcher" thread doesn't read this pointer after the "watched" thread exits. In other words, I think you might have a race here where the watcher prints arbitrary junk off of the target thread's stack.


http://gerrit.cloudera.org:8080/#/c/11021/1/be/src/runtime/krpc-data-stream-recvr.cc@239
PS1, Line 239:   kudu::SetStackTraceSignal(SIGRTMIN + 1);
you should call this at init time, not once per batch. I'm not even sure it's thread-safe, and if it is, it probably takes a lock so this would collapse throughput.



-- 
To view, visit http://gerrit.cloudera.org:8080/11021
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I260a1d0a3477e5c6a46094e664500c3e2ed7de62
Gerrit-Change-Number: 11021
Gerrit-PatchSet: 1
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Mon, 23 Jul 2018 23:56:24 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6214: Determine and warn about stuck fragment instances.

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/11021 )

Change subject: IMPALA-6214: Determine and warn about stuck fragment instances.
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/11021/1/be/src/runtime/krpc-data-stream-recvr.cc
File be/src/runtime/krpc-data-stream-recvr.cc:

http://gerrit.cloudera.org:8080/#/c/11021/1/be/src/runtime/krpc-data-stream-recvr.cc@266
PS1, Line 266:       data_arrival_cv_.wait(l);
maybe it would be simpler to just use a timed-wait variant here to periodically log if something gets stuck for a long time?



-- 
To view, visit http://gerrit.cloudera.org:8080/11021
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I260a1d0a3477e5c6a46094e664500c3e2ed7de62
Gerrit-Change-Number: 11021
Gerrit-PatchSet: 1
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Tue, 24 Jul 2018 20:06:51 +0000
Gerrit-HasComments: Yes