You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Raghu Angadi (JIRA)" <ji...@apache.org> on 2009/01/29 00:18:59 UTC

[jira] Issue Comment Edited: (HADOOP-4672) RPC on Datanode blocked forever.

    [ https://issues.apache.org/jira/browse/HADOOP-4672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12668220#action_12668220 ] 

rangadi edited comment on HADOOP-4672 at 1/28/09 3:18 PM:
---------------------------------------------------------------

I suspect it is going to be very hard to reproduce this problem. Even after finding the problem, mostly, we won't be able to do much about it. It is likely going to be some bad interaction between epoll/kernel/jdk.

Fortunately, all the stuck threads are reading or writing from sockets that don't have a timeout. So one work around is to have a timeout (something like 10 minutes). 

Currently following need timeout :
    - upstream socket in datanode write pipeline
    - IPC client writes to the server (reads already have a timeout and controlled by pings)


      was (Author: rangadi):
    I suspect it is going to be very hard to reproduce this problem. Even after finding the problem, mostly, we don't be able to do much about it.

Fortunately, all the stuck threads are reading or writing from sockets that don't have a timeout. So one work around is to have a timeout (something like 10 minutes). 

Currently following need timeout :
    - upstream socket in datanode write pipeline
    - IPC client writes to the server (reads already have a timeout and controlled by pings)

  
> RPC on Datanode blocked forever.
> --------------------------------
>
>                 Key: HADOOP-4672
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4672
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs, io
>    Affects Versions: 0.17.0
>         Environment: Java SE 1.6.0-b105 on Linux 2.6.x
>            Reporter: Raghu Angadi
>
> We recently noticed a number of datanodes got stuck. The main thread that sends heartbeats and block reports is blocked in select() in side blockReport() RPC.  I will add a stack trace in the next comment.
> I am not sure why select was blocked forever since there is no connection open to NameNode. In fact, NN was restarted in between. It could be some JDK bug or a Hadoop bug.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.