You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Stephen O'Donnell (Jira)" <ji...@apache.org> on 2020/01/31 18:15:00 UTC

[jira] [Commented] (HADOOP-16833) InstrumentedLock should log lock queue time

    [ https://issues.apache.org/jira/browse/HADOOP-16833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17027715#comment-17027715 ] 

Stephen O'Donnell commented on HADOOP-16833:
--------------------------------------------

I have uploaded a patch for this change. The original implementation has a feature where it will log if a thread hold holds the lock for too long and will only log 1 message per threshold. The datanode specifies 300ms as the "too long" time and 10 seconds at the log message limit. It also keeps a count of how many messages it has suppressed and logs that too.

I extended this, so there are now two types of logs. A log for waiting on the lock and a log for holding the lock. Each has its own suppression count and "last log time" so the two types of message are throttled separately.

I decided to use the same thresholds for both the wait and hold logging to keep it simple.

One Idea I had, was if the logger passed to instrumented lock has debug or trace enabled we could ignore the throttling and just log every time, but it could make the logs very noisy, so perhaps it is not a good idea after all.

> InstrumentedLock should log lock queue time
> -------------------------------------------
>
>                 Key: HADOOP-16833
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16833
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: util
>    Affects Versions: 3.3.0
>            Reporter: Stephen O'Donnell
>            Assignee: Stephen O'Donnell
>            Priority: Major
>         Attachments: HADOOP-16833.001.patch
>
>
> Within the Datanode, we often see contention around the FsDatasetImpl lock. This can be for various reasons, eg the DN is under IO load, and other bugs, eg HDFS-15131.
> When DN slow downs happen, it is very difficult to debug what is causing it, as there are few messages in the logs which indicate what is happening.
> In my experience, the presence of this log is informative:
> {code}
> 2019-11-27 09:00:49,678 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Took 21540ms to process 1 commands from NN
> {code}
> This comes from the datanode heartbeat thread, and it is logged when the commands returned from the heartbeat cannot be enqueued, and usually this is because the thread cannot get the FsDatasetImpl lock due to contention with other slow threads.
> HDFS-14997 moved the command processing to an async thread, and hence this useful message will disappear.
> InstrumentedLock introduced a feature to the datanode, where it will log a stack trace if any thread holds the lock for over 300ms. However this will not catch a scenario where 10 threads each hold a lock for 200ms, leading to the next in the queue having waited over 2 seconds.
> I believe it would be useful to extend InstrumentedLock to log if a thread has to wait for over some threshold. That way, we could be able to catch scenarios like the heartbeat thread shows us, but in a much more obvious way, provided lock contention is involved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org