You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Yang Wang (Jira)" <ji...@apache.org> on 2020/02/06 03:00:00 UTC
[jira] [Commented] (FLINK-15924) Detect and log blocking main thread operations

    [ https://issues.apache.org/jira/browse/FLINK-15924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031219#comment-17031219 ] 

Yang Wang commented on FLINK-15924:
-----------------------------------

This changes is really on point. As you say, it is not easy to guarantee that all the I/O blocking operations are not executed in the main thread. So it will be good if we could add some {{WARN}} log for the long rpc call. Also we could add some RPC processing metrics. For example, in Hadoop Yarn Resourcemanager, it have the following logs and metrics.

 
{code:java}
2020-02-06 10:22:34,469 WARN [IPC Server handler 22 on 8030] org.apache.hadoop.ipc.Server: Slow RPC : allocate took 1 milliseconds to process from client Call#468249 Retry#0 org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from xx.xx.xx.xx:27064
{code}
 

{code:java}
name: "Hadoop:service=ResourceManager,name=RpcActivityForPort8025",
modelerType: "RpcActivityForPort8025",
tag.port: "8025",
tag.Context: "rpc",
tag.NumOpenConnectionsPerUser: "{"hadoop":17}",
tag.Hostname: "z05c19394.sqa.zth.tbsite.net",
RpcQueueTime10sNumOps: 103,
RpcQueueTime10s50thPercentileLatency: 1,
RpcQueueTime10s75thPercentileLatency: 3,
RpcQueueTime10s90thPercentileLatency: 4,
RpcQueueTime10s95thPercentileLatency: 4,
RpcQueueTime10s99thPercentileLatency: 5,
RpcProcessingTime10sNumOps: 103,
RpcProcessingTime10s50thPercentileLatency: 3,
RpcProcessingTime10s75thPercentileLatency: 6,
RpcProcessingTime10s90thPercentileLatency: 8,
RpcProcessingTime10s95thPercentileLatency: 9,
RpcProcessingTime10s99thPercentileLatency: 10,
DeferredRpcProcessingTime10sNumOps: 0,
DeferredRpcProcessingTime10s50thPercentileLatency: 0,
DeferredRpcProcessingTime10s75thPercentileLatency: 0,
DeferredRpcProcessingTime10s90thPercentileLatency: 0,
DeferredRpcProcessingTime10s95thPercentileLatency: 0,
DeferredRpcProcessingTime10s99thPercentileLatency: 0,
ReceivedBytes: 86356893642,
SentBytes: 1808072466,
RpcQueueTimeNumOps: 41511724,
RpcQueueTimeAvgTime: 1.4071428571428573,
RpcProcessingTimeNumOps: 41511724,
RpcProcessingTimeAvgTime: 3.414285714285714,
... ...
{code}



> Detect and log blocking main thread operations
> ----------------------------------------------
>
>                 Key: FLINK-15924
>                 URL: https://issues.apache.org/jira/browse/FLINK-15924
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>    Affects Versions: 1.10.0
>            Reporter: Till Rohrmann
>            Priority: Major
>             Fix For: 1.11.0
>
>
> When using the {{RpcEndpoint}} it is important that all operations which run on the main thread are never blocking. We have seen in the past that it is quite hard to always catch blocking operations in reviews and sometimes these changes caused instabilities in Flink. Once this happens it is not trivial to find the culprit which is responsible for the blocking operation.
> One way to make debugging easier is to add a monitor which detects and logs if a {{RpcEndpoint}} operation takes longer than {{n}} seconds for example. Depending on the overhead of this monitor one could even think about enabling it only via a special configuration (e.g. debug mode).
> A proper class to introduce this monitor could be the {{AkkaRpcActor}} which is responsible for executing main thread operations. Whenever we schedule an operation, we could start a timeout which if triggered and the operation has not been completed will log a warning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)