You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Xudong Cao (Jira)" <ji...@apache.org> on 2019/10/31 07:52:00 UTC

[jira] [Updated] (HADOOP-16677) The remaining timeout millis printed into the log also need to be recalculated when thread is interrupted while blocking in select().

     [ https://issues.apache.org/jira/browse/HADOOP-16677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Xudong Cao updated HADOOP-16677:
--------------------------------
    Summary: The remaining timeout millis printed into the log also need to be recalculated when thread is interrupted while blocking in select().  (was: The remaining millis timeout printed into the log also need to be recalculated when thread is interrupted while blocking in select().)

> The remaining timeout millis printed into the log also need to be recalculated when thread is interrupted while blocking in select().
> -------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-16677
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16677
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: common
>    Affects Versions: 3.1.3
>            Reporter: Xudong Cao
>            Assignee: Xudong Cao
>            Priority: Minor
>
> In SocketIOWithTimeout, when a thread was interrupted and exit from select(), it proceed to throw an InterruptedIOException, in exception message the remaining timeout mills should be recalcuated correctly rather than simply give a total timeout millis,  otherwise it could be very misleading.
> For example, if an hdfs writer has not sent any packet to the pipeline more than 60s (e.g. full gc or network issues),  then one of pipeline datanodes may be timeout and close its sockets to other dns,  so its upstream DN's PacketResponder will immediately meet an EOF and then interrupt its own DataXceiver, finally its DataXeiver will print some logs like:
>  
> {code:java}
> 2019-10-24 09:22:58,212 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-753871533-10.215.131.216-1511957392115:blk_10646544613_95736750382019-10-24 09:22:58,212 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-753871533-10.215.131.216-1511957392115:blk_10646544613_9573675038java.io.InterruptedIOException: Interrupted while waiting for IO on channel java.nio.channels.SocketChannel[connected local=/10.196.146.114:9003 remote=/10.215.153.105:38559]. 60000 millis timeout left. at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:342) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) ...
>  
> {code}
>  
> This log is very misleading because a 60000 mills timeout left implies that the DataXceiver never blocks in select(), and this is unrealistic.  in fact, the truly timeout mills left should be: 60000 - timeElapsedWhenSelect.
> Finally, a properly log should be like this:
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org