You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Rainer Toebbicke <rt...@pclella.cern.ch> on 2015/01/27 17:49:09 UTC

cannot create files in hdfs when -put command issued on a datanode which is in exclude list

Hello,


I ran into what a weird problem creating files and for the minute I only have a shaky conclusion:

logged in as a vanilla user on a datanode the simple command "hdfs dfs -put /etc/motd motd" reproducibly bails out with

WARN hdfs.DFSClient: DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /user/tobbicke/motd._COPYING_ could only be replicated to 0 nodes instead of minReplication (=1).  There are 17 datanode(s) running and no node(s) are excluded in this operation.


Restarting datanodes did not help, the namenode logs were rather inconclusive. Trying to follow the only hint in there which were authentication problems from other users (we're using Kerberos) I happend to log in to another datanode, and to my surprise (!) there everything worked smoothly.

Trying on all of them with a mix of success and failures the only conclusion I came up with is that putting the datanode into "decommisionning" somehow affects client write access (no problem for -get), even from ordinary users.

Is this possible? Intended even, and if yes, what would be the logic behind that (after all I don't care on which datanodes the file ends up, there are plenty).

We're on Cloudera CDH 5.2.0, hadoop 2.5.0 in case that matters.

Any ideas?



Re: cannot create files in hdfs when -put command issued on a datanode which is in exclude list

Posted by Rainer Toebbicke <rt...@pclella.cern.ch>.
Hello,

adding to this: the hbase regionserver does not survive either when it runs into that situation! When putting a node into "decomissioning", if a regionserver has a file open on that node, it dies:


2015-01-28 10:11:18,178 FATAL [regionserver60020.logRoller] regionserver.HRegionServer: ABORTING region server xxxxx.cern.ch,60020,1422371469606: Failed log close in log roller
org.apache.hadoop.hbase.regionserver.wal.FailedLogCloseException: #1422436277964
        at org.apache.hadoop.hbase.regionserver.wal.FSHLog.cleanupCurrentWriter(FSHLog.java:787)
        at org.apache.hadoop.hbase.regionserver.wal.FSHLog.rollWriter(FSHLog.java:575)
        at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:97)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/xxxxx.cern.ch,60020,1422371469606/xxxxx.cern.ch%2C60020%2C1422371469606.1422436277964 could only be replicated to 0 nodes instead of minRepl
ication (=1).  There are 17 datanode(s) running and 17 node(s) are excluded in this operation.
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1492)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3027)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:614)
        at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:188)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:476)
....



Re: cannot create files in hdfs when -put command issued on a datanode which is in exclude list

Posted by Rainer Toebbicke <rt...@pclella.cern.ch>.
Hello,

adding to this: the hbase regionserver does not survive either when it runs into that situation! When putting a node into "decomissioning", if a regionserver has a file open on that node, it dies:


2015-01-28 10:11:18,178 FATAL [regionserver60020.logRoller] regionserver.HRegionServer: ABORTING region server xxxxx.cern.ch,60020,1422371469606: Failed log close in log roller
org.apache.hadoop.hbase.regionserver.wal.FailedLogCloseException: #1422436277964
        at org.apache.hadoop.hbase.regionserver.wal.FSHLog.cleanupCurrentWriter(FSHLog.java:787)
        at org.apache.hadoop.hbase.regionserver.wal.FSHLog.rollWriter(FSHLog.java:575)
        at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:97)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/xxxxx.cern.ch,60020,1422371469606/xxxxx.cern.ch%2C60020%2C1422371469606.1422436277964 could only be replicated to 0 nodes instead of minRepl
ication (=1).  There are 17 datanode(s) running and 17 node(s) are excluded in this operation.
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1492)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3027)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:614)
        at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:188)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:476)
....



Re: cannot create files in hdfs when -put command issued on a datanode which is in exclude list

Posted by Rainer Toebbicke <rt...@pclella.cern.ch>.
Hello,

adding to this: the hbase regionserver does not survive either when it runs into that situation! When putting a node into "decomissioning", if a regionserver has a file open on that node, it dies:


2015-01-28 10:11:18,178 FATAL [regionserver60020.logRoller] regionserver.HRegionServer: ABORTING region server xxxxx.cern.ch,60020,1422371469606: Failed log close in log roller
org.apache.hadoop.hbase.regionserver.wal.FailedLogCloseException: #1422436277964
        at org.apache.hadoop.hbase.regionserver.wal.FSHLog.cleanupCurrentWriter(FSHLog.java:787)
        at org.apache.hadoop.hbase.regionserver.wal.FSHLog.rollWriter(FSHLog.java:575)
        at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:97)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/xxxxx.cern.ch,60020,1422371469606/xxxxx.cern.ch%2C60020%2C1422371469606.1422436277964 could only be replicated to 0 nodes instead of minRepl
ication (=1).  There are 17 datanode(s) running and 17 node(s) are excluded in this operation.
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1492)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3027)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:614)
        at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:188)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:476)
....



Re: cannot create files in hdfs when -put command issued on a datanode which is in exclude list

Posted by Rainer Toebbicke <rt...@pclella.cern.ch>.
Hello,

adding to this: the hbase regionserver does not survive either when it runs into that situation! When putting a node into "decomissioning", if a regionserver has a file open on that node, it dies:


2015-01-28 10:11:18,178 FATAL [regionserver60020.logRoller] regionserver.HRegionServer: ABORTING region server xxxxx.cern.ch,60020,1422371469606: Failed log close in log roller
org.apache.hadoop.hbase.regionserver.wal.FailedLogCloseException: #1422436277964
        at org.apache.hadoop.hbase.regionserver.wal.FSHLog.cleanupCurrentWriter(FSHLog.java:787)
        at org.apache.hadoop.hbase.regionserver.wal.FSHLog.rollWriter(FSHLog.java:575)
        at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:97)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/xxxxx.cern.ch,60020,1422371469606/xxxxx.cern.ch%2C60020%2C1422371469606.1422436277964 could only be replicated to 0 nodes instead of minRepl
ication (=1).  There are 17 datanode(s) running and 17 node(s) are excluded in this operation.
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1492)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3027)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:614)
        at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:188)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:476)
....