You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Stephen O'Donnell (Jira)" <ji...@apache.org> on 2020/02/03 13:09:00 UTC

[jira] [Comment Edited] (HADOOP-16833) InstrumentedLock should log lock queue time

    [ https://issues.apache.org/jira/browse/HADOOP-16833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17028920#comment-17028920 ] 

Stephen O'Donnell edited comment on HADOOP-16833 at 2/3/20 1:08 PM:
--------------------------------------------------------------------

I uploaded a v3 patch. This ensure the instrumented lock will track the maximum wait time it suppressed.

The way the DN uses this lock, it sets a suppression interval of 10 seconds, and a log threshold of 300ms. 

The the first thread which exceeds the threshold and there has been at least 10 seconds since the last log message will get logged. But its possible for another thread to have waited longer and the suppressed message does not give any clues about the longest wait time, and it would be nice to see that.

Eg

{code}

-------------> | Log Message --------------| Log Message (triggered by "short wait") ---->

| Waiting ---------------------------> | suppressed
                           | short Wait -> | (this is triggers another log, masking the long wait)
{code}

I am not sure how clear that diagram makes it, but basically a tread could have been waiting for some time and the short waiter gets logged. With this change we should also get the longest waiting time during that interval.


was (Author: sodonnell):
I uploaded a v3 patch. This ensure the instrumented lock will track the maximum wait time it suppressed.

The way the DN uses this lock, it sets a suppression interval of 10 seconds, and a log threshold of 300ms. 

The the first thread which exceeds the threshold and there has been at least 10 seconds since the last log message will get logged. But its possible for another thread to have waited longer and the suppressed message does not give any clues about the longest wait time, and it would be nice to see that.

Eg

{code}

-------------> | Log Message --------------| Log Message (triggered by "short wait") ---->

| Waiting -----------------------------> | suppressed
                                                | short Wait -> | (this is triggers another log, masking the long wait)

{code}

I am not sure how clear that diagram makes it, but basically a tread could have been waiting for some time and the short waiter gets logged. With this change we should also get the longest waiting time during that interval.

> InstrumentedLock should log lock queue time
> -------------------------------------------
>
>                 Key: HADOOP-16833
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16833
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: util
>    Affects Versions: 3.3.0
>            Reporter: Stephen O'Donnell
>            Assignee: Stephen O'Donnell
>            Priority: Major
>         Attachments: HADOOP-16833.001.patch, HADOOP-16833.002.patch, HADOOP-16833.003.patch
>
>
> Within the Datanode, we often see contention around the FsDatasetImpl lock. This can be for various reasons, eg the DN is under IO load, and other bugs, eg HDFS-15131.
> When DN slow downs happen, it is very difficult to debug what is causing it, as there are few messages in the logs which indicate what is happening.
> In my experience, the presence of this log is informative:
> {code}
> 2019-11-27 09:00:49,678 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Took 21540ms to process 1 commands from NN
> {code}
> This comes from the datanode heartbeat thread, and it is logged when the commands returned from the heartbeat cannot be enqueued, and usually this is because the thread cannot get the FsDatasetImpl lock due to contention with other slow threads.
> HDFS-14997 moved the command processing to an async thread, and hence this useful message will disappear.
> InstrumentedLock introduced a feature to the datanode, where it will log a stack trace if any thread holds the lock for over 300ms. However this will not catch a scenario where 10 threads each hold a lock for 200ms, leading to the next in the queue having waited over 2 seconds.
> I believe it would be useful to extend InstrumentedLock to log if a thread has to wait for over some threshold. That way, we could be able to catch scenarios like the heartbeat thread shows us, but in a much more obvious way, provided lock contention is involved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org