You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2022/03/02 02:02:00 UTC
[jira] [Commented] (IMPALA-9128) Improve debugging for slow sends in KrpcDataStreamSender
[ https://issues.apache.org/jira/browse/IMPALA-9128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17499840#comment-17499840 ]
ASF subversion and git services commented on IMPALA-9128:
---------------------------------------------------------
Commit 719d3688950e9c22811ca2bb619b616f89679593 in impala's branch refs/heads/master from Riza Suminto
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=719d368 ]
IMPALA-10049: Include RPC call_id in slow RPC logs
KRPC log slow RPC trace in the receiver side. The trace log has the
call_id info that matches with the sender. However, our slow RPC logging
in the sender side does not log this call_id. It is hard to associate
the slow RPC logs between sender and receiver.
With the recent KRPC rebase in IMPALA-10931, we can now log the call_id
on the sender side.
Testing:
I tested this with a low threshold and delays added (the same as we did
in IMPALA-9128):
start-impala-cluster.py \
--impalad_args=--impala_slow_rpc_threshold_ms=1 \
--impalad_args=--debug_actions=END_DATA_STREAM_DELAY:JITTER@3000@1.0
The following is how the logs look like on the sender and receiver sides:
impalad_node1.INFO (sender):
I0217 10:29:36.278754 6606 krpc-data-stream-sender.cc:394] Slow TransmitData RPC (request call id 414) to 127.0.0.1:27002 (fragment_instance_id=d8453c2785c38df4:3473e28b00000041): took 343.279ms. Receiver time: 342.780ms Network time: 498.405us
impalad_node2.INFO (receiver):
I0217 10:29:36.278379 6775 rpcz_store.cc:269] Call impala.DataStreamService.TransmitData from 127.0.0.1:39702 (request call id 414) took 342ms. Trace:
I0217 10:29:36.278479 6775 rpcz_store.cc:270] 0217 10:29:35.935586 (+ 0us) impala-service-pool.cc:179] Inserting onto call queue
0217 10:29:36.277730 (+342144us) impala-service-pool.cc:278] Handling call
0217 10:29:36.277859 (+ 129us) krpc-data-stream-recvr.cc:397] Deserializing batch
0217 10:29:36.278330 (+ 471us) krpc-data-stream-recvr.cc:424] Enqueuing deserialized batch
0217 10:29:36.278369 (+ 39us) inbound_call.cc:171] Queueing success response
Metrics: {}
Change-Id: I7fb5746fa0be575745a8e168405d43115c425389
Reviewed-on: http://gerrit.cloudera.org:8080/18243
Reviewed-by: Wenzhe Zhou <wz...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
> Improve debugging for slow sends in KrpcDataStreamSender
> --------------------------------------------------------
>
> Key: IMPALA-9128
> URL: https://issues.apache.org/jira/browse/IMPALA-9128
> Project: IMPALA
> Issue Type: Bug
> Components: Distributed Exec
> Reporter: Tim Armstrong
> Assignee: Tim Armstrong
> Priority: Major
> Labels: observability
> Fix For: Impala 3.4.0
>
>
> I'm trying to debug a problem that appears to be caused by a slow RPC:
> {noformat}
> Fragment F00
> Instance 754fc21ba4744310:d58fd04200000020 (host=xxxxxxxxx)
> Hdfs split stats (<volume id>:<# splits>/<split lengths>): 0:1/120.48 MB
> - AverageThreadTokens: 1.00 (1.0)
> - BloomFilterBytes: 0 B (0)
> - InactiveTotalTime: 0ns (0)
> - PeakMemoryUsage: 3.2 MiB (3337546)
> - PeakReservation: 2.0 MiB (2097152)
> - PeakUsedReservation: 0 B (0)
> - PerHostPeakMemUsage: 6.7 MiB (6987376)
> - RowsProduced: 7 (7)
> - TotalNetworkReceiveTime: 0ns (0)
> - TotalNetworkSendTime: 3.6m (215354065071)
> - TotalStorageWaitTime: 4ms (4552708)
> - TotalThreadsInvoluntaryContextSwitches: 2 (2)
> - TotalThreadsTotalWallClockTime: 3.6m (215924079474)
> - TotalThreadsSysTime: 24ms (24386000)
> - TotalThreadsUserTime: 505ms (505714000)
> - TotalThreadsVoluntaryContextSwitches: 3,623 (3623)
> - TotalTime: 3.6m (215801961705)
> Fragment Instance Lifecycle Event Timeline
> Prepare Finished: 1ms (1812344)
> Open Finished: 322ms (322905753)
> First Batch Produced: 447ms (447050377)
> First Batch Sent: 447ms (447054546)
> ExecInternal Finished: 3.6m (215802284852)
> Buffer pool
> - AllocTime: 0ns (0)
> - CumulativeAllocationBytes: 0 B (0)
> - CumulativeAllocations: 0 (0)
> - InactiveTotalTime: 0ns (0)
> - PeakReservation: 0 B (0)
> - PeakUnpinnedBytes: 0 B (0)
> - PeakUsedReservation: 0 B (0)
> - ReadIoBytes: 0 B (0)
> - ReadIoOps: 0 (0)
> - ReadIoWaitTime: 0ns (0)
> - ReservationLimit: 0 B (0)
> - TotalTime: 0ns (0)
> - WriteIoBytes: 0 B (0)
> - WriteIoOps: 0 (0)
> - WriteIoWaitTime: 0ns (0)
> Fragment Instance Lifecycle Timings
> - ExecTime: 3.6m (215479380267)
> - ExecTreeExecTime: 124ms (124299400)
> - InactiveTotalTime: 0ns (0)
> - OpenTime: 321ms (321088906)
> - ExecTreeOpenTime: 572.04us (572045)
> - PrepareTime: 1ms (1426412)
> - ExecTreePrepareTime: 233.32us (233318)
> - TotalTime: 0ns (0)
> KrpcDataStreamSender (dst_id=3)
> - EosSent: 58 (58)
> - InactiveTotalTime: 3.6m (215354085858)
> - PeakMemoryUsage: 464.4 KiB (475504)
> - RowsSent: 7 (7)
> - RpcFailure: 0 (0)
> - RpcRetry: 0 (0)
> - SerializeBatchTime: 99.87us (99867)
> - TotalBytesSent: 207 B (207)
> - TotalTime: 3.6m (215355336381)
> - UncompressedRowBatchSize: 267 B (267)
> {noformat}
> We should add some diagnostics that will allow us to figure out which RPCs are slow and whether there's a pattern about which host is the problem. E.g. maybe we should log if the RPC time exceeds a configured threshold.
> It may also be useful to include some stats about the wait time, e.g. a histogram of the wait times, so that we can see if it's an outlier or general slowness.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org