You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Uma Maheswara Rao G (JIRA)" <ji...@apache.org> on 2011/04/12 12:38:06 UTC

[jira] [Created] (HADOOP-7221) When Namenode network is unplugged, DFSClient operations waits for ever

When Namenode network is unplugged, DFSClient operations waits for ever
-----------------------------------------------------------------------

                 Key: HADOOP-7221
                 URL: https://issues.apache.org/jira/browse/HADOOP-7221
             Project: Hadoop Common
          Issue Type: Bug
          Components: ipc
            Reporter: Uma Maheswara Rao G


When NN/DN is shutdown gracefully, the DFSClient operations which are waiting for a response from NN/DN, will throw exception & come out quickly

But when the NN/DN network is unplugged, the DFSClient operations which are waiting for a response from NN/DN, waits for ever.


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (HADOOP-7488) When Namenode network is unplugged, DFSClient operations waits for ever

Posted by "Uma Maheswara Rao G (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-7488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uma Maheswara Rao G resolved HADOOP-7488.
-----------------------------------------

    Resolution: Duplicate
    
> When Namenode network is unplugged, DFSClient operations waits for ever
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-7488
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7488
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ipc
>            Reporter: Uma Maheswara Rao G
>            Assignee: Uma Maheswara Rao G
>         Attachments: HADOOP-7488.patch
>
>
> When NN/DN is shutdown gracefully, the DFSClient operations which are waiting for a response from NN/DN, will throw exception & come out quickly
> But when the NN/DN network is unplugged, the DFSClient operations which are waiting for a response from NN/DN, waits for ever.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HADOOP-7488) When Namenode network is unplugged, DFSClient operations waits for ever

Posted by "Konstantin Shvachko (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-7488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13080491#comment-13080491 ] 

Konstantin Shvachko commented on HADOOP-7488:
---------------------------------------------

If {{rpcTimeout > 0}} then {{ handleTimeout()}} will throw {{SocketTimeoutException}} instead of going into ping loop. Can you control the required behavior by setting {{rpcTimeout > 0}} rather introducing the # of pings limit.

DataNodes and TaskTrackers are designed to ping NN and JT infinitely, because during startup you cannot predict when NN will come online as it depends on the size of the image and edits. Also when NN becomes busy it is important for DNs to keep retrying rather than assuming the NN is dead.

For DFSClient this may make sense, but I think they already timeout. At list DFSShell ls does. And even if they don't this should be an HDFS change not generic IPC change, which affects many Hadoop components.
 
As for HA I don't know what you did for HA and therefore cannot understand what problem you are trying to solve here. I can guess that you want DNs switch to another NN when they timeout rather than retrying. In this case you should be able to use rpcTimeout.

> When Namenode network is unplugged, DFSClient operations waits for ever
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-7488
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7488
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ipc
>            Reporter: Uma Maheswara Rao G
>            Assignee: Uma Maheswara Rao G
>         Attachments: HADOOP-7488.patch
>
>
> When NN/DN is shutdown gracefully, the DFSClient operations which are waiting for a response from NN/DN, will throw exception & come out quickly
> But when the NN/DN network is unplugged, the DFSClient operations which are waiting for a response from NN/DN, waits for ever.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HADOOP-7488) When Namenode network is unplugged, DFSClient operations waits for ever

Posted by "Uma Maheswara Rao G (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-7488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13100366#comment-13100366 ] 

Uma Maheswara Rao G commented on HADOOP-7488:
---------------------------------------------

Hi Konstantin,

  I want your opinion on this. Can you have a look?

Thanks
Uma

> When Namenode network is unplugged, DFSClient operations waits for ever
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-7488
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7488
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ipc
>            Reporter: Uma Maheswara Rao G
>            Assignee: Uma Maheswara Rao G
>         Attachments: HADOOP-7488.patch
>
>
> When NN/DN is shutdown gracefully, the DFSClient operations which are waiting for a response from NN/DN, will throw exception & come out quickly
> But when the NN/DN network is unplugged, the DFSClient operations which are waiting for a response from NN/DN, waits for ever.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HADOOP-7221) When Namenode network is unplugged, DFSClient operations waits for ever

Posted by "ramkrishna.s.vasudevan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-7221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13028030#comment-13028030 ] 

ramkrishna.s.vasudevan commented on HADOOP-7221:
------------------------------------------------

The problem was whenever network was unplugged the read operation was getting a timedout exception and it was trying again. This continued for almost 15 times and only then some connectionloss exception came and could come out.
By this time it was taking around 45 mins.
Hence we have done something like, configure the parameter 
"max.ping.retries.on.socket.timeout" to a value where you can configure a value after which it should come out after getting a socket time.  So while retrying chk for this configured value and once reached come out.

This problem comes only in unplug scenarios.  So based on the scenario this value can be configured as to when how much time it should chk to get a connection.

> When Namenode network is unplugged, DFSClient operations waits for ever
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-7221
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7221
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ipc
>            Reporter: Uma Maheswara Rao G
>
> When NN/DN is shutdown gracefully, the DFSClient operations which are waiting for a response from NN/DN, will throw exception & come out quickly
> But when the NN/DN network is unplugged, the DFSClient operations which are waiting for a response from NN/DN, waits for ever.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (HADOOP-7488) When Namenode network is unplugged, DFSClient operations waits for ever

Posted by "Uma Maheswara Rao G (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-7488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uma Maheswara Rao G reassigned HADOOP-7488:
-------------------------------------------

    Assignee: Uma Maheswara Rao G

> When Namenode network is unplugged, DFSClient operations waits for ever
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-7488
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7488
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ipc
>            Reporter: Uma Maheswara Rao G
>            Assignee: Uma Maheswara Rao G
>         Attachments: HADOOP-7488.patch
>
>
> When NN/DN is shutdown gracefully, the DFSClient operations which are waiting for a response from NN/DN, will throw exception & come out quickly
> But when the NN/DN network is unplugged, the DFSClient operations which are waiting for a response from NN/DN, waits for ever.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HADOOP-7488) When Namenode network is unplugged, DFSClient operations waits for ever

Posted by "Uma Maheswara Rao G (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-7488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072350#comment-13072350 ] 

Uma Maheswara Rao G commented on HADOOP-7488:
---------------------------------------------

Thanks Jhon for taking alook on this issue.
Updated a patch for review!.

This patch introduces a property( max.ping.retries.on.socket.timeout ). Default value will be -1, represents that disabling of this property.

In this scenario, if we unplug the network cable between the nodes, this ping reads will get timeouts continuosly . SocketTimeOuts was handled and retried infinitely.So, it was waiting for long time.....

Now , to avoid this problem, we can configure the number of ping retries.
Anyway continuos timeouts means , somthing wrong in network/cluster, we can restrict this retries by configuring the above property.

Bydefault this property will be disabled.


--Thanks

> When Namenode network is unplugged, DFSClient operations waits for ever
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-7488
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7488
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ipc
>            Reporter: Uma Maheswara Rao G
>         Attachments: HADOOP-7488.patch
>
>
> When NN/DN is shutdown gracefully, the DFSClient operations which are waiting for a response from NN/DN, will throw exception & come out quickly
> But when the NN/DN network is unplugged, the DFSClient operations which are waiting for a response from NN/DN, waits for ever.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Moved] (HADOOP-7488) When Namenode network is unplugged, DFSClient operations waits for ever

Posted by "Uma Maheswara Rao G (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-7488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uma Maheswara Rao G moved HDFS-1880 to HADOOP-7488:
---------------------------------------------------

    Component/s:     (was: hdfs client)
                 ipc
            Key: HADOOP-7488  (was: HDFS-1880)
        Project: Hadoop Common  (was: Hadoop HDFS)

> When Namenode network is unplugged, DFSClient operations waits for ever
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-7488
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7488
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ipc
>            Reporter: Uma Maheswara Rao G
>
> When NN/DN is shutdown gracefully, the DFSClient operations which are waiting for a response from NN/DN, will throw exception & come out quickly
> But when the NN/DN network is unplugged, the DFSClient operations which are waiting for a response from NN/DN, waits for ever.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HADOOP-7488) When Namenode network is unplugged, DFSClient operations waits for ever

Posted by "Uma Maheswara Rao G (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-7488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uma Maheswara Rao G updated HADOOP-7488:
----------------------------------------

    Status: Open  (was: Patch Available)

> When Namenode network is unplugged, DFSClient operations waits for ever
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-7488
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7488
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ipc
>            Reporter: Uma Maheswara Rao G
>            Assignee: Uma Maheswara Rao G
>         Attachments: HADOOP-7488.patch
>
>
> When NN/DN is shutdown gracefully, the DFSClient operations which are waiting for a response from NN/DN, will throw exception & come out quickly
> But when the NN/DN network is unplugged, the DFSClient operations which are waiting for a response from NN/DN, waits for ever.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HADOOP-7488) When Namenode network is unplugged, DFSClient operations waits for ever

Posted by "Uma Maheswara Rao G (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-7488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076044#comment-13076044 ] 

Uma Maheswara Rao G commented on HADOOP-7488:
---------------------------------------------


updated a patch for review!

> When Namenode network is unplugged, DFSClient operations waits for ever
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-7488
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7488
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ipc
>            Reporter: Uma Maheswara Rao G
>            Assignee: Uma Maheswara Rao G
>         Attachments: HADOOP-7488.patch
>
>
> When NN/DN is shutdown gracefully, the DFSClient operations which are waiting for a response from NN/DN, will throw exception & come out quickly
> But when the NN/DN network is unplugged, the DFSClient operations which are waiting for a response from NN/DN, waits for ever.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HADOOP-7221) When Namenode network is unplugged, DFSClient operations waits for ever

Posted by "Steve Loughran (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-7221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13028262#comment-13028262 ] 

Steve Loughran commented on HADOOP-7221:
----------------------------------------

# which version are you seeing this on?
# when you say unplugged, do you mean the ethernet port of your local machine came unplugged, or the connection to the remote server failed?
# Can you add the stack trace you see in the exceptions, to show where the problem is?

This is an HDFS problem, so re-assigning there


> When Namenode network is unplugged, DFSClient operations waits for ever
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-7221
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7221
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: hdfs client
>            Reporter: Uma Maheswara Rao G
>
> When NN/DN is shutdown gracefully, the DFSClient operations which are waiting for a response from NN/DN, will throw exception & come out quickly
> But when the NN/DN network is unplugged, the DFSClient operations which are waiting for a response from NN/DN, waits for ever.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-7488) When Namenode network is unplugged, DFSClient operations waits for ever

Posted by "Uma Maheswara Rao G (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-7488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085760#comment-13085760 ] 

Uma Maheswara Rao G commented on HADOOP-7488:
---------------------------------------------

Hi Konstantin,

Thanks alot for taking a look on this issue.


{quote}
If rpcTimeout > 0 then {{ handleTimeout()}} will throw SocketTimeoutException instead of going into ping loop. Can you control the required behavior by setting rpcTimeout > 0 rather introducing the # of pings limit.
{quote}
 Yes, with this parameter also, we can control.

 I am planning to add below code in DataNode when gettng the proxy.

 {code}
        // get NN proxy
      DatanodeProtocol dnp = 
        (DatanodeProtocol)RPC.waitForProxy(DatanodeProtocol.class,
            DatanodeProtocol.versionID, nnAddr, conf, socketTimeout,
       Long.MAX_VALUE);
 {code}

  Here the sockettimeout is rpcTimeOut. 
 this property already used for createInterDataNodeProtocolProxy as rpcTimeOut.
 this.socketTimeout =  conf.getInt(DFS_CLIENT_SOCKET_TIMEOUT_KEY,
                                      HdfsConstants.READ_TIMEOUT);

But my question is, if i use socketTimeout (default 60*1000 ms) as rpcTimeOut, default behaviour will be changed. I dont want to change the default behavior here.
 any suggestion for this? 

{quote}
DataNodes and TaskTrackers are designed to ping NN and JT infinitely, because during startup you cannot predict when NN will come online as it depends on the size of the image and edits. Also when NN becomes busy it is important for DNs to keep retrying rather than assuming the NN is dead.
{quote}

Yes. But there are some scenarios like network unplug may thorugh tomeouts and because of the timeout handlings, unneccerily system will be blocked for long time.
As i know, even if we through that timeout exception out to JT or DN, they will handle it and retry again in their offerService methods.
except in below condition
{code}
 catch(RemoteException re) {
          String reClass = re.getClassName();
          if (UnregisteredNodeException.class.getName().equals(reClass) ||
              DisallowedDatanodeException.class.getName().equals(reClass) ||
              IncorrectVersionException.class.getName().equals(reClass)) {
            LOG.warn("blockpool " + blockPoolId + " is shutting down", re);
            shouldServiceRun = false;
            return;
          }
{code}


{quote}
And even if they don't this should be an HDFS change not generic IPC change, which affects many Hadoop components
{quote}
  
 What i felt is, this particular issue will be applicable for all the components who is using Hadoop IPC. And also planned to retain the default behaviour as it is to not effect the other componenets. and if user really required then he will tune the configuration parameter based on his requirement.

Anyway we decided to use rcpTimeOut right, IPC user code only should pass this value. In that case this will come under HDFS specific chnage. Also need to check the for MapReduce as well ( same situation for JT) 


{quote}
As for HA I don't know what you did for HA and therefore cannot understand what problem you are trying to solve here. I can guess that you want DNs switch to another NN when they timeout rather than retrying. In this case you should be able to use rpcTimeout
{quote}
 Yes, your guess is correct :-)
 In our HA solution, we are using *BackupNode* and Switching framework is *Zookeeper based* LeaderElection. DNs will contain both the active and standby node addresses configured. On any failure, DNs will try to switch to other NN. 
 Here the scenario is, We unplugged the active NN network card, then all DN are blocked for long time.


--Thanks

> When Namenode network is unplugged, DFSClient operations waits for ever
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-7488
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7488
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ipc
>            Reporter: Uma Maheswara Rao G
>            Assignee: Uma Maheswara Rao G
>         Attachments: HADOOP-7488.patch
>
>
> When NN/DN is shutdown gracefully, the DFSClient operations which are waiting for a response from NN/DN, will throw exception & come out quickly
> But when the NN/DN network is unplugged, the DFSClient operations which are waiting for a response from NN/DN, waits for ever.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HADOOP-7488) When Namenode network is unplugged, DFSClient operations waits for ever

Posted by "Uma Maheswara Rao G (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-7488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13439844#comment-13439844 ] 

Uma Maheswara Rao G commented on HADOOP-7488:
---------------------------------------------

Can make use of HADOOP-6889. marking it as duplicate to it.
                
> When Namenode network is unplugged, DFSClient operations waits for ever
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-7488
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7488
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ipc
>            Reporter: Uma Maheswara Rao G
>            Assignee: Uma Maheswara Rao G
>         Attachments: HADOOP-7488.patch
>
>
> When NN/DN is shutdown gracefully, the DFSClient operations which are waiting for a response from NN/DN, will throw exception & come out quickly
> But when the NN/DN network is unplugged, the DFSClient operations which are waiting for a response from NN/DN, waits for ever.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HADOOP-7488) When Namenode network is unplugged, DFSClient operations waits for ever

Posted by "Uma Maheswara Rao G (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-7488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uma Maheswara Rao G updated HADOOP-7488:
----------------------------------------

    Attachment: HADOOP-7488.patch

> When Namenode network is unplugged, DFSClient operations waits for ever
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-7488
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7488
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ipc
>            Reporter: Uma Maheswara Rao G
>         Attachments: HADOOP-7488.patch
>
>
> When NN/DN is shutdown gracefully, the DFSClient operations which are waiting for a response from NN/DN, will throw exception & come out quickly
> But when the NN/DN network is unplugged, the DFSClient operations which are waiting for a response from NN/DN, waits for ever.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HADOOP-7488) When Namenode network is unplugged, DFSClient operations waits for ever

Posted by "Uma Maheswara Rao G (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-7488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uma Maheswara Rao G updated HADOOP-7488:
----------------------------------------

    Status: Patch Available  (was: Open)

> When Namenode network is unplugged, DFSClient operations waits for ever
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-7488
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7488
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ipc
>            Reporter: Uma Maheswara Rao G
>         Attachments: HADOOP-7488.patch
>
>
> When NN/DN is shutdown gracefully, the DFSClient operations which are waiting for a response from NN/DN, will throw exception & come out quickly
> But when the NN/DN network is unplugged, the DFSClient operations which are waiting for a response from NN/DN, waits for ever.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira