You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Eran Kutner <er...@gigya.com> on 2011/04/10 17:07:34 UTC

Cluster crash

Hi,
While doing load testing on HBase the entire cluster crashed with
errors like these in hbase logs:

2011-04-10 10:14:30,844 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_1213779416283711358_54194 bad datanode[0]
10.1.104.1:50010
2011-04-10 10:14:30,844 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_1213779416283711358_54194 in pipeline
10.1.104.1:50010, 10.1.104.5:50010, 10.1.104.2:50010: bad datanode
10.1.104.1:50010
2011-04-10 10:14:30,880 WARN org.apache.hadoop.hdfs.DFSClient: Failed
recovery attempt #2 from primary datanode 10.1.104.2:50010
org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.ipc.RemoteException: java.io.IOException: Block
(=blk_1213779416283711358_54194) not found
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.commitBlockSynchronization(FSNamesystem.java:2099)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.commitBlockSynchronization(NameNode.java:703)
        at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1416)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1412)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1410)

        at org.apache.hadoop.ipc.Client.call(Client.java:1104)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
        at $Proxy4.commitBlockSynchronization(Unknown Source)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.syncBlock(DataNode.java:1847)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1828)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1924)
        at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1416)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1412)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1410)

        at org.apache.hadoop.ipc.Client.call(Client.java:1104)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
        at $Proxy8.recoverBlock(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2785)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2289)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2461)
2011-04-10 10:14:30,880 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_1213779416283711358_54194 failed  because
recovery from primary datanode 10.1.104.2:50010 failed 3 times.
Pipeline was 10.1.104.1:50010, 10.1.104.5:50010, 10.1.104.2:50010.
Will retry...
2011-04-10 10:14:31,882 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_1213779416283711358_54194 bad datanode[0]
10.1.104.1:50010
2011-04-10 10:14:31,882 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_1213779416283711358_54194 in pipeline
10.1.104.1:50010, 10.1.104.5:50010, 10.1.104.2:50010: bad datanode
10.1.104.1:50010
2011-04-10 10:14:31,905 WARN org.apache.hadoop.hdfs.DFSClient: Failed
recovery attempt #3 from primary datanode 10.1.104.2:50010
org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.ipc.RemoteException: java.io.IOException: Block
(=blk_1213779416283711358_54194) not found
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.commitBlockSynchronization(FSNamesystem.java:2099)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.commitBlockSynchronization(NameNode.java:703)
        at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1416)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1412)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1410)

        at org.apache.hadoop.ipc.Client.call(Client.java:1104)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
        at $Proxy4.commitBlockSynchronization(Unknown Source)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.syncBlock(DataNode.java:1847)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1828)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1924)
        at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1416)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1412)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1410)

        at org.apache.hadoop.ipc.Client.call(Client.java:1104)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
        at $Proxy8.recoverBlock(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2785)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2289)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2461)
2011-04-10 10:14:31,905 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_1213779416283711358_54194 failed  because
recovery from primary datanode 10.1.104.2:50010 failed 4 times.
Pipeline was 10.1.104.1:50010, 10.1.104.5:50010, 10.1.104.2:50010.
Will retry...

The Datanode log for this block ID shows:
# cat  /var/log/hadoop/hadoop-hadoop-datanode-hadoop1-s02.log | grep
1213779416283711358
2011-04-10 07:29:23,858 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
blk_1213779416283711358_54194 src: /10.1.104.5:40533 dest:
/10.1.104.2:50010
2011-04-10 10:12:58,242 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
receiveBlock for block blk_1213779416283711358_54194
java.io.EOFException: while trying to read 871 bytes
2011-04-10 10:12:58,242 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for
block blk_1213779416283711358_54194 Interrupted.
2011-04-10 10:12:58,242 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for
block blk_1213779416283711358_54194 terminating
2011-04-10 10:12:58,242 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
blk_1213779416283711358_54194 received exception java.io.EOFException:
while trying to read 871 bytes
2011-04-10 10:12:58,288 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode:
oldblock=blk_1213779416283711358_54194(length=162696),
newblock=blk_1213779416283711358_54249(length=162696),
datanode=10.1.104.2:50010
2011-04-10 10:12:58,288 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
blk_1213779416283711358_54249 of size 162696 as part of lease
recovery.
2011-04-10 10:12:58,766 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode:
oldblock=blk_1213779416283711358_54249(length=162696),
newblock=blk_1213779416283711358_54250(length=162696),
datanode=10.1.104.2:50010
2011-04-10 10:12:58,803 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
blk_1213779416283711358_54250 src: /10.1.104.5:33159 dest:
/10.1.104.2:50010
2011-04-10 10:12:58,805 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Reopen Block for
append blk_1213779416283711358_54250
2011-04-10 10:12:58,827 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: setBlockPosition
trying to set position to 162696 for block
blk_1213779416283711358_54250 which is not a multiple of
bytesPerChecksum 512
2011-04-10 10:12:58,827 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode:
computePartialChunkCrc sizePartialChunk 392 block
blk_1213779416283711358_54250 offset in block 162304 offset in
metafile 1275
2011-04-10 10:12:58,827 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Read in partial CRC
chunk from disk for block blk_1213779416283711358_54250
2011-04-10 10:12:58,828 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
/10.1.104.5:33159, dest: /10.1.104.2:50010, bytes: 162696, op:
HDFS_WRITE, cliID:
DFSClient_hb_m_hadoop1-hbm1.farm-ny.gigya.com:60000_1301599306867,
offset: 0, srvID: DS-1554836204-10.1.104.2-50010-1298986316425,
blockid: blk_1213779416283711358_54250, duration: 21639000
2011-04-10 10:12:58,828 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for
block blk_1213779416283711358_54250 terminating
2011-04-10 10:13:31,080 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
blk_1213779416283711358_54250 file
/hadoop/dfs/data/current/subdir61/subdir17/blk_1213779416283711358 for
deletion
2011-04-10 10:13:31,250 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
blk_1213779416283711358_54250 at file
/hadoop/dfs/data/current/subdir61/subdir17/blk_1213779416283711358
2011-04-10 10:14:29,811 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Client calls
recoverBlock(block=blk_1213779416283711358_54194,
targets=[10.1.104.5:50010, 10.1.104.2:50010])
2011-04-10 10:14:29,814 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
found for block blk_1213779416283711358_54194 on datanode
10.1.104.5:50010
2011-04-10 10:14:29,814 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
found for block blk_1213779416283711358_54194 on datanode
10.1.104.2:50010
2011-04-10 10:14:29,815 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 0 on 50020, call recoverBlock(blk_1213779416283711358_54194,
false, [Lorg.apache.hadoop.hdfs.protocol.DatanodeInfo;@c754900) from
10.1.104.1:34481: error: org.apache.hadoop.ipc.RemoteException:
java.io.IOException: Block (=blk_1213779416283711358_54194) not found
org.apache.hadoop.ipc.RemoteException: java.io.IOException: Block
(=blk_1213779416283711358_54194) not found
2011-04-10 10:14:29,828 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Client calls
recoverBlock(block=blk_1213779416283711358_54194,
targets=[10.1.104.5:50010, 10.1.104.2:50010])
2011-04-10 10:14:29,830 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
found for block blk_1213779416283711358_54194 on datanode
10.1.104.5:50010
2011-04-10 10:14:29,830 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
found for block blk_1213779416283711358_54194 on datanode
10.1.104.2:50010
2011-04-10 10:14:29,831 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 1 on 50020, call recoverBlock(blk_1213779416283711358_54194,
false, [Lorg.apache.hadoop.hdfs.protocol.DatanodeInfo;@516485d9) from
10.1.104.1:34489: error: org.apache.hadoop.ipc.RemoteException:
java.io.IOException: Block (=blk_1213779416283711358_54194) not found
org.apache.hadoop.ipc.RemoteException: java.io.IOException: Block
(=blk_1213779416283711358_54194) not found
2011-04-10 10:14:30,854 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Client calls
recoverBlock(block=blk_1213779416283711358_54194,
targets=[10.1.104.5:50010, 10.1.104.2:50010])
2011-04-10 10:14:30,856 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
found for block blk_1213779416283711358_54194 on datanode
10.1.104.5:50010
2011-04-10 10:14:30,856 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
found for block blk_1213779416283711358_54194 on datanode
10.1.104.2:50010
2011-04-10 10:14:30,857 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 2 on 50020, call recoverBlock(blk_1213779416283711358_54194,
false, [Lorg.apache.hadoop.hdfs.protocol.DatanodeInfo;@62ee365a) from
10.1.104.1:34499: error: org.apache.hadoop.ipc.RemoteException:
java.io.IOException: Block (=blk_1213779416283711358_54194) not found
org.apache.hadoop.ipc.RemoteException: java.io.IOException: Block
(=blk_1213779416283711358_54194) not found
2011-04-10 10:14:31,879 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Client calls
recoverBlock(block=blk_1213779416283711358_54194,
targets=[10.1.104.5:50010, 10.1.104.2:50010])
2011-04-10 10:14:31,882 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
found for block blk_1213779416283711358_54194 on datanode
10.1.104.5:50010
2011-04-10 10:14:31,882 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
found for block blk_1213779416283711358_54194 on datanode
10.1.104.2:50010
2011-04-10 10:14:31,883 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 0 on 50020, call recoverBlock(blk_1213779416283711358_54194,
false, [Lorg.apache.hadoop.hdfs.protocol.DatanodeInfo;@ed5547f) from
10.1.104.1:34500: error: org.apache.hadoop.ipc.RemoteException:
java.io.IOException: Block (=blk_1213779416283711358_54194) not found
org.apache.hadoop.ipc.RemoteException: java.io.IOException: Block
(=blk_1213779416283711358_54194) not found
2011-04-10 10:14:32,905 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Client calls
recoverBlock(block=blk_1213779416283711358_54194,
targets=[10.1.104.5:50010, 10.1.104.2:50010])
2011-04-10 10:14:32,907 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
found for block blk_1213779416283711358_54194 on datanode
10.1.104.5:50010
2011-04-10 10:14:32,907 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
found for block blk_1213779416283711358_54194 on datanode
10.1.104.2:50010
2011-04-10 10:14:32,908 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 1 on 50020, call recoverBlock(blk_1213779416283711358_54194,
false, [Lorg.apache.hadoop.hdfs.protocol.DatanodeInfo;@2dfd9614) from
10.1.104.1:34511: error: org.apache.hadoop.ipc.RemoteException:
java.io.IOException: Block (=blk_1213779416283711358_54194) not found
org.apache.hadoop.ipc.RemoteException: java.io.IOException: Block
(=blk_1213779416283711358_54194) not found
2011-04-10 10:14:33,935 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Client calls
recoverBlock(block=blk_1213779416283711358_54194,
targets=[10.1.104.5:50010, 10.1.104.2:50010])
2011-04-10 10:14:33,937 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
found for block blk_1213779416283711358_54194 on datanode
10.1.104.5:50010
2011-04-10 10:14:33,937 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
found for block blk_1213779416283711358_54194 on datanode
10.1.104.2:50010
2011-04-10 10:14:33,938 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 2 on 50020, call recoverBlock(blk_1213779416283711358_54194,
false, [Lorg.apache.hadoop.hdfs.protocol.DatanodeInfo;@22fe47f3) from
10.1.104.1:34522: error: org.apache.hadoop.ipc.RemoteException:
java.io.IOException: Block (=blk_1213779416283711358_54194) not found
org.apache.hadoop.ipc.RemoteException: java.io.IOException: Block
(=blk_1213779416283711358_54194) not found



Running "hadoop fsck /" says the filesystem is healthy.

The same kind of errors happened on all the region servers but for
different blocks and against differnt datanodes, eventually the entire
cluster shut itself down. This is the second time I'm having this
problem while running load tests on the system. Previous time, once it
started happening it would happen every hour or even without any load
being run on the cluster, presumably during compaction. I wasn't sure
if it was somehow my fault so I truncated the entire table which
stopped the cluster crashes but now it's happening again.

Any idea what's going on?

-eran

Re: Cluster crash

Posted by Stack <st...@duboce.net>.
What versions of the software?  I see you searched the block in DN.
Whats it say if you search it in NN logs?
Thanks,
St.Ack

On Sun, Apr 10, 2011 at 8:07 AM, Eran Kutner <er...@gigya.com> wrote:
> Hi,
> While doing load testing on HBase the entire cluster crashed with
> errors like these in hbase logs:
>
> 2011-04-10 10:14:30,844 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_1213779416283711358_54194 bad datanode[0]
> 10.1.104.1:50010
> 2011-04-10 10:14:30,844 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_1213779416283711358_54194 in pipeline
> 10.1.104.1:50010, 10.1.104.5:50010, 10.1.104.2:50010: bad datanode
> 10.1.104.1:50010
> 2011-04-10 10:14:30,880 WARN org.apache.hadoop.hdfs.DFSClient: Failed
> recovery attempt #2 from primary datanode 10.1.104.2:50010
> org.apache.hadoop.ipc.RemoteException:
> org.apache.hadoop.ipc.RemoteException: java.io.IOException: Block
> (=blk_1213779416283711358_54194) not found
>        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.commitBlockSynchronization(FSNamesystem.java:2099)
>        at org.apache.hadoop.hdfs.server.namenode.NameNode.commitBlockSynchronization(NameNode.java:703)
>        at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source)
>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1416)
>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1412)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at javax.security.auth.Subject.doAs(Subject.java:396)
>        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1410)
>
>        at org.apache.hadoop.ipc.Client.call(Client.java:1104)
>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
>        at $Proxy4.commitBlockSynchronization(Unknown Source)
>        at org.apache.hadoop.hdfs.server.datanode.DataNode.syncBlock(DataNode.java:1847)
>        at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1828)
>        at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1924)
>        at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1416)
>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1412)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at javax.security.auth.Subject.doAs(Subject.java:396)
>        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1410)
>
>        at org.apache.hadoop.ipc.Client.call(Client.java:1104)
>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
>        at $Proxy8.recoverBlock(Unknown Source)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2785)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2289)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2461)
> 2011-04-10 10:14:30,880 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_1213779416283711358_54194 failed  because
> recovery from primary datanode 10.1.104.2:50010 failed 3 times.
> Pipeline was 10.1.104.1:50010, 10.1.104.5:50010, 10.1.104.2:50010.
> Will retry...
> 2011-04-10 10:14:31,882 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_1213779416283711358_54194 bad datanode[0]
> 10.1.104.1:50010
> 2011-04-10 10:14:31,882 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_1213779416283711358_54194 in pipeline
> 10.1.104.1:50010, 10.1.104.5:50010, 10.1.104.2:50010: bad datanode
> 10.1.104.1:50010
> 2011-04-10 10:14:31,905 WARN org.apache.hadoop.hdfs.DFSClient: Failed
> recovery attempt #3 from primary datanode 10.1.104.2:50010
> org.apache.hadoop.ipc.RemoteException:
> org.apache.hadoop.ipc.RemoteException: java.io.IOException: Block
> (=blk_1213779416283711358_54194) not found
>        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.commitBlockSynchronization(FSNamesystem.java:2099)
>        at org.apache.hadoop.hdfs.server.namenode.NameNode.commitBlockSynchronization(NameNode.java:703)
>        at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source)
>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1416)
>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1412)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at javax.security.auth.Subject.doAs(Subject.java:396)
>        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1410)
>
>        at org.apache.hadoop.ipc.Client.call(Client.java:1104)
>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
>        at $Proxy4.commitBlockSynchronization(Unknown Source)
>        at org.apache.hadoop.hdfs.server.datanode.DataNode.syncBlock(DataNode.java:1847)
>        at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1828)
>        at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1924)
>        at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1416)
>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1412)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at javax.security.auth.Subject.doAs(Subject.java:396)
>        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1410)
>
>        at org.apache.hadoop.ipc.Client.call(Client.java:1104)
>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
>        at $Proxy8.recoverBlock(Unknown Source)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2785)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2289)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2461)
> 2011-04-10 10:14:31,905 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_1213779416283711358_54194 failed  because
> recovery from primary datanode 10.1.104.2:50010 failed 4 times.
> Pipeline was 10.1.104.1:50010, 10.1.104.5:50010, 10.1.104.2:50010.
> Will retry...
>
> The Datanode log for this block ID shows:
> # cat  /var/log/hadoop/hadoop-hadoop-datanode-hadoop1-s02.log | grep
> 1213779416283711358
> 2011-04-10 07:29:23,858 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> blk_1213779416283711358_54194 src: /10.1.104.5:40533 dest:
> /10.1.104.2:50010
> 2011-04-10 10:12:58,242 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
> receiveBlock for block blk_1213779416283711358_54194
> java.io.EOFException: while trying to read 871 bytes
> 2011-04-10 10:12:58,242 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for
> block blk_1213779416283711358_54194 Interrupted.
> 2011-04-10 10:12:58,242 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for
> block blk_1213779416283711358_54194 terminating
> 2011-04-10 10:12:58,242 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
> blk_1213779416283711358_54194 received exception java.io.EOFException:
> while trying to read 871 bytes
> 2011-04-10 10:12:58,288 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode:
> oldblock=blk_1213779416283711358_54194(length=162696),
> newblock=blk_1213779416283711358_54249(length=162696),
> datanode=10.1.104.2:50010
> 2011-04-10 10:12:58,288 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
> blk_1213779416283711358_54249 of size 162696 as part of lease
> recovery.
> 2011-04-10 10:12:58,766 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode:
> oldblock=blk_1213779416283711358_54249(length=162696),
> newblock=blk_1213779416283711358_54250(length=162696),
> datanode=10.1.104.2:50010
> 2011-04-10 10:12:58,803 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> blk_1213779416283711358_54250 src: /10.1.104.5:33159 dest:
> /10.1.104.2:50010
> 2011-04-10 10:12:58,805 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Reopen Block for
> append blk_1213779416283711358_54250
> 2011-04-10 10:12:58,827 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: setBlockPosition
> trying to set position to 162696 for block
> blk_1213779416283711358_54250 which is not a multiple of
> bytesPerChecksum 512
> 2011-04-10 10:12:58,827 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode:
> computePartialChunkCrc sizePartialChunk 392 block
> blk_1213779416283711358_54250 offset in block 162304 offset in
> metafile 1275
> 2011-04-10 10:12:58,827 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Read in partial CRC
> chunk from disk for block blk_1213779416283711358_54250
> 2011-04-10 10:12:58,828 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
> /10.1.104.5:33159, dest: /10.1.104.2:50010, bytes: 162696, op:
> HDFS_WRITE, cliID:
> DFSClient_hb_m_hadoop1-hbm1.farm-ny.gigya.com:60000_1301599306867,
> offset: 0, srvID: DS-1554836204-10.1.104.2-50010-1298986316425,
> blockid: blk_1213779416283711358_54250, duration: 21639000
> 2011-04-10 10:12:58,828 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for
> block blk_1213779416283711358_54250 terminating
> 2011-04-10 10:13:31,080 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
> blk_1213779416283711358_54250 file
> /hadoop/dfs/data/current/subdir61/subdir17/blk_1213779416283711358 for
> deletion
> 2011-04-10 10:13:31,250 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
> blk_1213779416283711358_54250 at file
> /hadoop/dfs/data/current/subdir61/subdir17/blk_1213779416283711358
> 2011-04-10 10:14:29,811 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Client calls
> recoverBlock(block=blk_1213779416283711358_54194,
> targets=[10.1.104.5:50010, 10.1.104.2:50010])
> 2011-04-10 10:14:29,814 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
> found for block blk_1213779416283711358_54194 on datanode
> 10.1.104.5:50010
> 2011-04-10 10:14:29,814 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
> found for block blk_1213779416283711358_54194 on datanode
> 10.1.104.2:50010
> 2011-04-10 10:14:29,815 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 0 on 50020, call recoverBlock(blk_1213779416283711358_54194,
> false, [Lorg.apache.hadoop.hdfs.protocol.DatanodeInfo;@c754900) from
> 10.1.104.1:34481: error: org.apache.hadoop.ipc.RemoteException:
> java.io.IOException: Block (=blk_1213779416283711358_54194) not found
> org.apache.hadoop.ipc.RemoteException: java.io.IOException: Block
> (=blk_1213779416283711358_54194) not found
> 2011-04-10 10:14:29,828 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Client calls
> recoverBlock(block=blk_1213779416283711358_54194,
> targets=[10.1.104.5:50010, 10.1.104.2:50010])
> 2011-04-10 10:14:29,830 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
> found for block blk_1213779416283711358_54194 on datanode
> 10.1.104.5:50010
> 2011-04-10 10:14:29,830 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
> found for block blk_1213779416283711358_54194 on datanode
> 10.1.104.2:50010
> 2011-04-10 10:14:29,831 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 1 on 50020, call recoverBlock(blk_1213779416283711358_54194,
> false, [Lorg.apache.hadoop.hdfs.protocol.DatanodeInfo;@516485d9) from
> 10.1.104.1:34489: error: org.apache.hadoop.ipc.RemoteException:
> java.io.IOException: Block (=blk_1213779416283711358_54194) not found
> org.apache.hadoop.ipc.RemoteException: java.io.IOException: Block
> (=blk_1213779416283711358_54194) not found
> 2011-04-10 10:14:30,854 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Client calls
> recoverBlock(block=blk_1213779416283711358_54194,
> targets=[10.1.104.5:50010, 10.1.104.2:50010])
> 2011-04-10 10:14:30,856 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
> found for block blk_1213779416283711358_54194 on datanode
> 10.1.104.5:50010
> 2011-04-10 10:14:30,856 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
> found for block blk_1213779416283711358_54194 on datanode
> 10.1.104.2:50010
> 2011-04-10 10:14:30,857 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 2 on 50020, call recoverBlock(blk_1213779416283711358_54194,
> false, [Lorg.apache.hadoop.hdfs.protocol.DatanodeInfo;@62ee365a) from
> 10.1.104.1:34499: error: org.apache.hadoop.ipc.RemoteException:
> java.io.IOException: Block (=blk_1213779416283711358_54194) not found
> org.apache.hadoop.ipc.RemoteException: java.io.IOException: Block
> (=blk_1213779416283711358_54194) not found
> 2011-04-10 10:14:31,879 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Client calls
> recoverBlock(block=blk_1213779416283711358_54194,
> targets=[10.1.104.5:50010, 10.1.104.2:50010])
> 2011-04-10 10:14:31,882 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
> found for block blk_1213779416283711358_54194 on datanode
> 10.1.104.5:50010
> 2011-04-10 10:14:31,882 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
> found for block blk_1213779416283711358_54194 on datanode
> 10.1.104.2:50010
> 2011-04-10 10:14:31,883 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 0 on 50020, call recoverBlock(blk_1213779416283711358_54194,
> false, [Lorg.apache.hadoop.hdfs.protocol.DatanodeInfo;@ed5547f) from
> 10.1.104.1:34500: error: org.apache.hadoop.ipc.RemoteException:
> java.io.IOException: Block (=blk_1213779416283711358_54194) not found
> org.apache.hadoop.ipc.RemoteException: java.io.IOException: Block
> (=blk_1213779416283711358_54194) not found
> 2011-04-10 10:14:32,905 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Client calls
> recoverBlock(block=blk_1213779416283711358_54194,
> targets=[10.1.104.5:50010, 10.1.104.2:50010])
> 2011-04-10 10:14:32,907 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
> found for block blk_1213779416283711358_54194 on datanode
> 10.1.104.5:50010
> 2011-04-10 10:14:32,907 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
> found for block blk_1213779416283711358_54194 on datanode
> 10.1.104.2:50010
> 2011-04-10 10:14:32,908 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 1 on 50020, call recoverBlock(blk_1213779416283711358_54194,
> false, [Lorg.apache.hadoop.hdfs.protocol.DatanodeInfo;@2dfd9614) from
> 10.1.104.1:34511: error: org.apache.hadoop.ipc.RemoteException:
> java.io.IOException: Block (=blk_1213779416283711358_54194) not found
> org.apache.hadoop.ipc.RemoteException: java.io.IOException: Block
> (=blk_1213779416283711358_54194) not found
> 2011-04-10 10:14:33,935 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Client calls
> recoverBlock(block=blk_1213779416283711358_54194,
> targets=[10.1.104.5:50010, 10.1.104.2:50010])
> 2011-04-10 10:14:33,937 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
> found for block blk_1213779416283711358_54194 on datanode
> 10.1.104.5:50010
> 2011-04-10 10:14:33,937 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
> found for block blk_1213779416283711358_54194 on datanode
> 10.1.104.2:50010
> 2011-04-10 10:14:33,938 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 2 on 50020, call recoverBlock(blk_1213779416283711358_54194,
> false, [Lorg.apache.hadoop.hdfs.protocol.DatanodeInfo;@22fe47f3) from
> 10.1.104.1:34522: error: org.apache.hadoop.ipc.RemoteException:
> java.io.IOException: Block (=blk_1213779416283711358_54194) not found
> org.apache.hadoop.ipc.RemoteException: java.io.IOException: Block
> (=blk_1213779416283711358_54194) not found
>
>
>
> Running "hadoop fsck /" says the filesystem is healthy.
>
> The same kind of errors happened on all the region servers but for
> different blocks and against differnt datanodes, eventually the entire
> cluster shut itself down. This is the second time I'm having this
> problem while running load tests on the system. Previous time, once it
> started happening it would happen every hour or even without any load
> being run on the cluster, presumably during compaction. I wasn't sure
> if it was somehow my fault so I truncated the entire table which
> stopped the cluster crashes but now it's happening again.
>
> Any idea what's going on?
>
> -eran
>

Re: Cluster crash

Posted by Eran Kutner <er...@gigya.com>.
After 3 days of running with the configuration changes recommended by J-D
the cluster seems stable now.
For the benefit of others I would say there were two issues identified:
First, the HBASE_HEAP was set too high. It turns out that each Haddop daemon
takes at least 1GB at startup even if it's doing nothing. Since we have a
data node, a task tracker and a thrift server running on each machine those
take up 3GB or RAM that must be accounted for when allocating memory for the
region server.
Second, we had "-XX:+CMSIncrementalMode" configured, which apparently is not
good with multi-core systems.

Thanks J-D for all the help.

-eran



On Mon, Apr 11, 2011 at 23:53, Jean-Daniel Cryans <jd...@apache.org>wrote:

> Alright so I was able to get the logs from Eran, the HDFS errors are a
> red herring, what followed in the region server log that is really
> important is:
>
> 2011-04-10 10:14:27,278 INFO org.apache.zookeeper.ClientCnxn: Client
> session timed out, have not heard from server in 144490ms for
> sessionid 0x12ee42283320050, closing socket connection and attempting
> reconnect
>
> Which is a 2m20s GC pause. The HDFS errors come from the fact that the
> master split the logs _while_ the region server was sleeping.
>
> J-D
>
> On Mon, Apr 11, 2011 at 11:47 AM, Jean-Daniel Cryans
> <jd...@apache.org> wrote:
> > So my understanding is that this log file was opened at 7:29 and then
> > something happened at 10:12:55 as something triggered the recovery on
> > that block. It triggered a recovery of the block with the new name
> > being blk_1213779416283711358_54249
> >
> > It seems that that process was started by the DFS Client at 10:12:55
> > but the RS log starts at 10:14. Would it be possible to see what was
> > before that? Also it would be nice to have a view for those blocks on
> > all the datanodes.
> >
> > It would be nice to do this debugging on IRC is it can require a lot
> > of back and forth.
> >
> > J-D
> >
> > On Mon, Apr 11, 2011 at 11:22 AM, Eran Kutner <er...@.com> wrote:
> >> There wasn't an attachment, I pasted all the lines from all the NN logs
> that
> >> contain that particular block number inline.
> >>
> >> As for CPU/IO, first there is nothing else running on those servers,
> second,
> >> CPU utilization on the slaves at peak load was around 40% and disk IO
> >> utilization less than 20%. That's the strange thing about it (I have
> another
> >> thread going about the performance), there is no bottleneck I could
> identify
> >> and yet the performance was relatively low, compared to the numbers I
> see
> >> quoted for HBase in other places.
> >>
> >> The first line of the NN log says:
> >> BLOCK* NameSystem.allocateBlock:
> >> /hbase/.logs/hadoop1-s01.farm-ny.gigya.com,60020,1302185988579/
> hadoop1-s01.farm-ny.gigya.com
> %3A60020.1302434963279.blk_1213779416283711358_54194
> >> So it looks like a file name is:
> >> /hbase/.logs/hadoop1-s01.farm-ny.gigya.com,60020,1302185988579/
> hadoop1-s01.farm-ny.gigya.com%3A60020.1302434963279
> >>
> >> Is there a better way to associate a file with a block?
> >>
> >> -eran
> >>
> >>
> >>
> >
>

Re: Cluster crash

Posted by Jean-Daniel Cryans <jd...@apache.org>.
Alright so I was able to get the logs from Eran, the HDFS errors are a
red herring, what followed in the region server log that is really
important is:

2011-04-10 10:14:27,278 INFO org.apache.zookeeper.ClientCnxn: Client
session timed out, have not heard from server in 144490ms for
sessionid 0x12ee42283320050, closing socket connection and attempting
reconnect

Which is a 2m20s GC pause. The HDFS errors come from the fact that the
master split the logs _while_ the region server was sleeping.

J-D

On Mon, Apr 11, 2011 at 11:47 AM, Jean-Daniel Cryans
<jd...@apache.org> wrote:
> So my understanding is that this log file was opened at 7:29 and then
> something happened at 10:12:55 as something triggered the recovery on
> that block. It triggered a recovery of the block with the new name
> being blk_1213779416283711358_54249
>
> It seems that that process was started by the DFS Client at 10:12:55
> but the RS log starts at 10:14. Would it be possible to see what was
> before that? Also it would be nice to have a view for those blocks on
> all the datanodes.
>
> It would be nice to do this debugging on IRC is it can require a lot
> of back and forth.
>
> J-D
>
> On Mon, Apr 11, 2011 at 11:22 AM, Eran Kutner <er...@.com> wrote:
>> There wasn't an attachment, I pasted all the lines from all the NN logs that
>> contain that particular block number inline.
>>
>> As for CPU/IO, first there is nothing else running on those servers, second,
>> CPU utilization on the slaves at peak load was around 40% and disk IO
>> utilization less than 20%. That's the strange thing about it (I have another
>> thread going about the performance), there is no bottleneck I could identify
>> and yet the performance was relatively low, compared to the numbers I see
>> quoted for HBase in other places.
>>
>> The first line of the NN log says:
>> BLOCK* NameSystem.allocateBlock:
>> /hbase/.logs/hadoop1-s01.farm-ny.gigya.com,60020,1302185988579/hadoop1-s01.farm-ny.gigya.com%3A60020.1302434963279.blk_1213779416283711358_54194
>> So it looks like a file name is:
>> /hbase/.logs/hadoop1-s01.farm-ny.gigya.com,60020,1302185988579/hadoop1-s01.farm-ny.gigya.com%3A60020.1302434963279
>>
>> Is there a better way to associate a file with a block?
>>
>> -eran
>>
>>
>>
>

Re: Cluster crash

Posted by Jean-Daniel Cryans <jd...@apache.org>.
So my understanding is that this log file was opened at 7:29 and then
something happened at 10:12:55 as something triggered the recovery on
that block. It triggered a recovery of the block with the new name
being blk_1213779416283711358_54249

It seems that that process was started by the DFS Client at 10:12:55
but the RS log starts at 10:14. Would it be possible to see what was
before that? Also it would be nice to have a view for those blocks on
all the datanodes.

It would be nice to do this debugging on IRC is it can require a lot
of back and forth.

J-D

On Mon, Apr 11, 2011 at 11:22 AM, Eran Kutner <er...@.com> wrote:
> There wasn't an attachment, I pasted all the lines from all the NN logs that
> contain that particular block number inline.
>
> As for CPU/IO, first there is nothing else running on those servers, second,
> CPU utilization on the slaves at peak load was around 40% and disk IO
> utilization less than 20%. That's the strange thing about it (I have another
> thread going about the performance), there is no bottleneck I could identify
> and yet the performance was relatively low, compared to the numbers I see
> quoted for HBase in other places.
>
> The first line of the NN log says:
> BLOCK* NameSystem.allocateBlock:
> /hbase/.logs/hadoop1-s01.farm-ny.gigya.com,60020,1302185988579/hadoop1-s01.farm-ny.gigya.com%3A60020.1302434963279.blk_1213779416283711358_54194
> So it looks like a file name is:
> /hbase/.logs/hadoop1-s01.farm-ny.gigya.com,60020,1302185988579/hadoop1-s01.farm-ny.gigya.com%3A60020.1302434963279
>
> Is there a better way to associate a file with a block?
>
> -eran
>
>
>

Re: Cluster crash

Posted by Eran Kutner <er...@gigya.com>.
There wasn't an attachment, I pasted all the lines from all the NN logs that
contain that particular block number inline.

As for CPU/IO, first there is nothing else running on those servers, second,
CPU utilization on the slaves at peak load was around 40% and disk IO
utilization less than 20%. That's the strange thing about it (I have another
thread going about the performance), there is no bottleneck I could identify
and yet the performance was relatively low, compared to the numbers I see
quoted for HBase in other places.

The first line of the NN log says:
BLOCK* NameSystem.allocateBlock: /hbase/.logs/hadoop1-s01.farm-ny.gigya.com
,60020,1302185988579/hadoop1-s01.farm-ny.gigya.com
%3A60020.1302434963279.blk_1213779416283711358_54194
So it looks like a file name is: /hbase/.logs/hadoop1-s01.farm-ny.gigya.com
,60020,1302185988579/hadoop1-s01.farm-ny.gigya.com%3A60020.1302434963279

Is there a better way to associate a file with a block?

-eran



On Mon, Apr 11, 2011 at 21:00, Stack <st...@duboce.net> wrote:

> On Sun, Apr 10, 2011 at 11:30 PM, Eran Kutner <er...@gigya.com> wrote:
> > Hi St.Ack and J-D,
> > Thanks for looking into this.
> >
> > It can definitely be a configuration problem, but I seriously doubt it
> > is a network or infrastructure problem. It's our own operated
> > infrastructure (not a cloud)  and we have a lot of other services
> > running on it without any problem.
>
> Services that could be stealing i/o and cpu from hbase cluster?  Is
> that possible?
>
> > Note that Hbase is complaining
> > about multiple data nodes (10.1.104.1, 10.1.104.2, 10.1.104.5), I
> > attached the logs from just one of them but it's more or less the same
> > on all. Please see the NN log for the same block below.
> >
>
> An attachment? Did it come through?  Perhaps pastebin it and then add link
> here?
>
>
> > We are using Hadoop 0.20.2-CDH3B4
> > and Hbase Version 0.90.2-SNAPSHOT, rUnknown, Wed Mar 23 06:09:51 EDT
> > 2011 (I built that from the 0.90.2 branch to try to fix a problem with
> > replication)
> >
>
>
>
>
> > </configuration>
> >
>
> Your configuration looks fine.
>
> Can you associate the block with a file?  I don't see the association
> in the below.  I see us trying to delete the block (would like to know
> why -- file deleteion?) and then it does exist on .2 for whatever
> reason.
>
> St.Ack
>
> >
> > This is the NN log for the same block:
> > 2011-04-10 07:29:23,835 INFO org.apache.hadoop.hdfs.StateChange:
> > BLOCK* NameSystem.allocateBlock:
> > /hbase/.logs/hadoop1-s01.farm-ny.gigya.com,60020,1302185988579/
> hadoop1-s01.farm-ny.gigya.com%3A60020.1302434963279.
> > blk_1213779416283711358_54194
> > 2011-04-10 10:12:55,749 INFO org.apache.hadoop.hdfs.StateChange:
> > BLOCK* blk_1213779416283711358_54194 recovery started,
> > primary=10.1.104.1:50010
> > 2011-04-10 10:12:58,292 INFO org.apache.hadoop.hdfs.StateChange:
> > BLOCK* NameSystem.addStoredBlock: Targets updated: block
> > blk_1213779416283711358_54249 on 10.1.104.1:50010 is added as a target
> > for block blk_1213779416283711358_54194 with size 162696
> > 2011-04-10 10:12:58,293 INFO org.apache.hadoop.hdfs.StateChange:
> > BLOCK* NameSystem.addStoredBlock: Targets updated: block
> > blk_1213779416283711358_54249 on 10.1.104.5:50010 is added as a target
> > for block blk_1213779416283711358_54194 with size 162696
> > 2011-04-10 10:12:58,294 INFO org.apache.hadoop.hdfs.StateChange:
> > BLOCK* NameSystem.addStoredBlock: Targets updated: block
> > blk_1213779416283711358_54249 on 10.1.104.2:50010 is added as a target
> > for block blk_1213779416283711358_54194 with size 162696
> > 2011-04-10 10:12:58,295 INFO
> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> > commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
> > newgenerationstamp=54249, newlength=162696,
> > newtargets=[10.1.104.1:50010, 10.1.104.5:50010, 10.1.104.2:50010],
> > closeFile=true, deleteBlock=false)
> > 2011-04-10 10:12:58,340 INFO
> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> > commitBlockSynchronization(newblock=blk_1213779416283711358_54249,
> > file=/hbase/.logs/hadoop1-s01.farm-ny.gigya.com,60020,1302185988579/
> hadoop1-s01.farm-ny.gigya.com%3A60020.1302434963279,
> > newgenerationstamp=54249, newlength=162696,
> > newtargets=[10.1.104.1:50010, 10.1.104.5:50010, 10.1.104.2:50010])
> > successful
> > 2011-04-10 10:12:58,773 INFO
> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> > commitBlockSynchronization(lastblock=blk_1213779416283711358_54249,
> > newgenerationstamp=54250, newlength=162696,
> > newtargets=[10.1.104.1:50010, 10.1.104.5:50010, 10.1.104.2:50010],
> > closeFile=false, deleteBlock=false)
> > 2011-04-10 10:12:58,798 INFO
> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> > commitBlockSynchronization(blk_1213779416283711358_54250) successful
> > 2011-04-10 10:12:58,833 INFO org.apache.hadoop.hdfs.StateChange:
> > BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.1.104.2:50010
> > is added to blk_1213779416283711358_54250 size 162696
> > 2011-04-10 10:12:58,834 INFO org.apache.hadoop.hdfs.StateChange:
> > BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.1.104.5:50010
> > is added to blk_1213779416283711358_54250 size 162696
> > 2011-04-10 10:12:58,834 INFO org.apache.hadoop.hdfs.StateChange:
> > BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.1.104.1:50010
> > is added to blk_1213779416283711358_54250 size 162696
> > 2011-04-10 10:13:28,277 INFO org.apache.hadoop.hdfs.StateChange:
> > BLOCK* NameSystem.addToInvalidates: blk_1213779416283711358 is added
> > to invalidSet of 10.1.104.2:50010
> > 2011-04-10 10:13:28,277 INFO org.apache.hadoop.hdfs.StateChange:
> > BLOCK* NameSystem.addToInvalidates: blk_1213779416283711358 is added
> > to invalidSet of 10.1.104.5:50010
> > 2011-04-10 10:13:28,277 INFO org.apache.hadoop.hdfs.StateChange:
> > BLOCK* NameSystem.addToInvalidates: blk_1213779416283711358 is added
> > to invalidSet of 10.1.104.1:50010
> > 2011-04-10 10:13:29,811 INFO org.apache.hadoop.hdfs.StateChange:
> > BLOCK* ask 10.1.104.2:50010 to delete  blk_-826484977486799080_54240
> > blk_-3859382357539098123_54247 blk_6239319017569949674_54242
> > blk_2498532690928849052_54243 blk_-4133675075382826161_54234
> > blk_-6299622736350943739_54246 blk_1213779416283711358_54250
> > blk_-7407505773873780456_54244 blk_1070089292787245089_54237
> > 2011-04-10 10:13:32,813 INFO org.apache.hadoop.hdfs.StateChange:
> > BLOCK* ask 10.1.104.1:50010 to delete  blk_-826484977486799080_54240
> > blk_-9184330013088281910_54231 blk_842002398362359267_54239
> > blk_6239319017569949674_54242 blk_-4133675075382826161_54234
> > blk_-2420408100786396293_54245 blk_1213779416283711358_54250
> > blk_1475680055217098335_54241 blk_-8623426430834046353_54236
> > blk_-3859382357539098123_54247 blk_2498532690928849052_54243
> > blk_-6299622736350943739_54246 blk_-7407505773873780456_54244
> > blk_1070089292787245089_54237
> > 2011-04-10 10:13:32,813 INFO org.apache.hadoop.hdfs.StateChange:
> > BLOCK* ask 10.1.104.5:50010 to delete  blk_-8623426430834046353_54236
> > blk_-3859382357539098123_54247 blk_-2420408100786396293_54245
> > blk_-4133675075382826161_54234 blk_1213779416283711358_54250
> > 2011-04-10 10:14:29,820 INFO
> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> > commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
> > newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
> > deleteBlock=true)
> > 2011-04-10 10:14:29,820 INFO org.apache.hadoop.ipc.Server: IPC Server
> > handler 8 on 8020, call
> > commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
> > true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@b8682c8) from
> > 10.1.104.2:46012: error: java.io.IOException: Block
> > (=blk_1213779416283711358_54194) not found
> > java.io.IOException: Block (=blk_1213779416283711358_54194) not found
> > 2011-04-10 10:14:29,835 INFO
> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> > commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
> > newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
> > deleteBlock=true)
> > 2011-04-10 10:14:29,836 INFO org.apache.hadoop.ipc.Server: IPC Server
> > handler 6 on 8020, call
> > commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
> > true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@2bf41e61) from
> > 10.1.104.2:46012: error: java.io.IOException: Block
> > (=blk_1213779416283711358_54194) not found
> > java.io.IOException: Block (=blk_1213779416283711358_54194) not found
> > 2011-04-10 10:14:30,862 INFO
> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> > commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
> > newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
> > deleteBlock=true)
> > 2011-04-10 10:14:30,862 INFO org.apache.hadoop.ipc.Server: IPC Server
> > handler 2 on 8020, call
> > commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
> > true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@27f688d2) from
> > 10.1.104.2:46012: error: java.io.IOException: Block
> > (=blk_1213779416283711358_54194) not found
> > java.io.IOException: Block (=blk_1213779416283711358_54194) not found
> > 2011-04-10 10:14:31,888 INFO
> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> > commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
> > newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
> > deleteBlock=true)
> > 2011-04-10 10:14:31,888 INFO org.apache.hadoop.ipc.Server: IPC Server
> > handler 5 on 8020, call
> > commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
> > true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@2a90a77d) from
> > 10.1.104.2:46012: error: java.io.IOException: Block
> > (=blk_1213779416283711358_54194) not found
> > java.io.IOException: Block (=blk_1213779416283711358_54194) not found
> > 2011-04-10 10:14:32,913 INFO
> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> > commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
> > newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
> > deleteBlock=true)
> > 2011-04-10 10:14:32,913 INFO org.apache.hadoop.ipc.Server: IPC Server
> > handler 0 on 8020, call
> > commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
> > true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@7ee4155f) from
> > 10.1.104.2:46012: error: java.io.IOException: Block
> > (=blk_1213779416283711358_54194) not found
> > java.io.IOException: Block (=blk_1213779416283711358_54194) not found
> > 2011-04-10 10:14:33,943 INFO
> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> > commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
> > newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
> > deleteBlock=true)
> > 2011-04-10 10:14:33,943 INFO org.apache.hadoop.ipc.Server: IPC Server
> > handler 3 on 8020, call
> > commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
> > true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@3037510e) from
> > 10.1.104.2:46012: error: java.io.IOException: Block
> > (=blk_1213779416283711358_54194) not found
> > java.io.IOException: Block (=blk_1213779416283711358_54194) not found
> > 2011-04-10 10:14:34,968 INFO
> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> > commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
> > newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
> > deleteBlock=true)
> > 2011-04-10 10:14:34,968 INFO org.apache.hadoop.ipc.Server: IPC Server
> > handler 9 on 8020, call
> > commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
> > true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@7faa86dc) from
> > 10.1.104.1:42703: error: java.io.IOException: Block
> > (=blk_1213779416283711358_54194) not found
> > java.io.IOException: Block (=blk_1213779416283711358_54194) not found
> > 2011-04-10 10:14:35,992 INFO
> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> > commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
> > newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
> > deleteBlock=true)
> > 2011-04-10 10:14:35,993 INFO org.apache.hadoop.ipc.Server: IPC Server
> > handler 6 on 8020, call
> > commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
> > true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@147c16ff) from
> > 10.1.104.1:42703: error: java.io.IOException: Block
> > (=blk_1213779416283711358_54194) not found
> > java.io.IOException: Block (=blk_1213779416283711358_54194) not found
> > 2011-04-10 10:14:37,005 INFO
> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> > commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
> > newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
> > deleteBlock=true)
> > 2011-04-10 10:14:37,005 INFO org.apache.hadoop.ipc.Server: IPC Server
> > handler 9 on 8020, call
> > commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
> > true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@5ec9c9da) from
> > 10.1.104.1:42703: error: java.io.IOException: Block
> > (=blk_1213779416283711358_54194) not found
> > java.io.IOException: Block (=blk_1213779416283711358_54194) not found
> > 2011-04-10 10:14:38,016 INFO
> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> > commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
> > newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
> > deleteBlock=true)
> > 2011-04-10 10:14:38,016 INFO org.apache.hadoop.ipc.Server: IPC Server
> > handler 5 on 8020, call
> > commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
> > true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@5c664878) from
> > 10.1.104.1:42703: error: java.io.IOException: Block
> > (=blk_1213779416283711358_54194) not found
> > java.io.IOException: Block (=blk_1213779416283711358_54194) not found
> > 2011-04-10 10:14:39,029 INFO
> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> > commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
> > newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
> > deleteBlock=true)
> > 2011-04-10 10:14:39,029 INFO org.apache.hadoop.ipc.Server: IPC Server
> > handler 2 on 8020, call
> > commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
> > true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@3f1fedac) from
> > 10.1.104.1:42703: error: java.io.IOException: Block
> > (=blk_1213779416283711358_54194) not found
> > java.io.IOException: Block (=blk_1213779416283711358_54194) not found
> > 2011-04-10 10:14:40,039 INFO
> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> > commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
> > newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
> > deleteBlock=true)
> > 2011-04-10 10:14:40,039 INFO org.apache.hadoop.ipc.Server: IPC Server
> > handler 2 on 8020, call
> > commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
> > true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@1efd5b14) from
> > 10.1.104.1:42703: error: java.io.IOException: Block
> > (=blk_1213779416283711358_54194) not found
> > java.io.IOException: Block (=blk_1213779416283711358_54194) not found
> > 2011-04-10 10:14:41,050 INFO
> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> > commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
> > newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
> > deleteBlock=true)
> > 2011-04-10 10:14:41,050 INFO org.apache.hadoop.ipc.Server: IPC Server
> > handler 3 on 8020, call
> > commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
> > true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@b668df1) from
> > 10.1.104.5:54533: error: java.io.IOException: Block
> > (=blk_1213779416283711358_54194) not found
> > java.io.IOException: Block (=blk_1213779416283711358_54194) not found
> > 2011-04-10 10:14:42,060 INFO
> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> > commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
> > newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
> > deleteBlock=true)
> > 2011-04-10 10:14:42,060 INFO org.apache.hadoop.ipc.Server: IPC Server
> > handler 6 on 8020, call
> > commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
> > true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@79f0cf0f) from
> > 10.1.104.5:54533: error: java.io.IOException: Block
> > (=blk_1213779416283711358_54194) not found
> > java.io.IOException: Block (=blk_1213779416283711358_54194) not found
> > 2011-04-10 10:14:43,069 INFO
> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> > commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
> > newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
> > deleteBlock=true)
> > 2011-04-10 10:14:43,069 INFO org.apache.hadoop.ipc.Server: IPC Server
> > handler 0 on 8020, call
> > commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
> > true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@30aa2054) from
> > 10.1.104.5:54533: error: java.io.IOException: Block
> > (=blk_1213779416283711358_54194) not found
> > java.io.IOException: Block (=blk_1213779416283711358_54194) not found
> > 2011-04-10 10:14:44,078 INFO
> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> > commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
> > newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
> > deleteBlock=true)
> > 2011-04-10 10:14:44,078 INFO org.apache.hadoop.ipc.Server: IPC Server
> > handler 8 on 8020, call
> > commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
> > true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@713083c1) from
> > 10.1.104.5:54533: error: java.io.IOException: Block
> > (=blk_1213779416283711358_54194) not found
> > java.io.IOException: Block (=blk_1213779416283711358_54194) not found
> > 2011-04-10 10:14:45,088 INFO
> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> > commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
> > newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
> > deleteBlock=true)
> > 2011-04-10 10:14:45,088 INFO org.apache.hadoop.ipc.Server: IPC Server
> > handler 8 on 8020, call
> > commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
> > true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@4ebf4464) from
> > 10.1.104.5:54533: error: java.io.IOException: Block
> > (=blk_1213779416283711358_54194) not found
> > java.io.IOException: Block (=blk_1213779416283711358_54194) not found
> > 2011-04-10 10:14:46,098 INFO
> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> > commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
> > newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
> > deleteBlock=true)
> > 2011-04-10 10:14:46,098 INFO org.apache.hadoop.ipc.Server: IPC Server
> > handler 5 on 8020, call
> > commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
> > true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@6f23299f) from
> > 10.1.104.5:54533: error: java.io.IOException: Block
> > (=blk_1213779416283711358_54194) not found
> > java.io.IOException: Block (=blk_1213779416283711358_54194) not found
> >
> >
> >
> > -eran
> >
> >
> >
> >
> > On Sun, Apr 10, 2011 at 22:27, Jean-Daniel Cryans <jd...@apache.org>
> wrote:
> >> My experience in debugging those kind of issues is that 95% of the
> >> time it's a configuration issue, 4.99% of the time it's environment
> >> and network issues (network splits, lost packets, etc), and the
> >> remaining 0.01% is actual HDFS issues.
> >>
> >> The fact that you're saying that you had issues even with no load
> >> makes me think it's a configuration issue. Can we see your hdfs
> >> config?
> >>
> >> BTW the HBase log was pointing at 10.1.104.1 as the one having an
> >> issue, is that the log we are looking at? (it doesn't seem so)
> >>
> >> Thx,
> >>
> >> J-D
> >>
> >> On Sun, Apr 10, 2011 at 12:05 PM, Eran Kutner <er...@gaigya.com> wrote:
> >>> This is how it's configured in /etc/security/limits.con on all the
> >>> slaves in the cluster:
> >>> hadoop           -       nofile          32768
> >>> hdfs             -       nofile          32768
> >>> hbase            -       nofile          32768
> >>> hadoop           -       nproc           32000
> >>> hdfs           -       nproc           32000
> >>> hbase           -       nproc           32000
> >>>
> >>> When hbase is loading it prints:
> >>> ulimit -n 32768
> >>>
> >>>
> >>> -eran
> >>
> >
>

Re: Cluster crash

Posted by Stack <st...@duboce.net>.
On Sun, Apr 10, 2011 at 11:30 PM, Eran Kutner <er...@gigya.com> wrote:
> Hi St.Ack and J-D,
> Thanks for looking into this.
>
> It can definitely be a configuration problem, but I seriously doubt it
> is a network or infrastructure problem. It's our own operated
> infrastructure (not a cloud)  and we have a lot of other services
> running on it without any problem.

Services that could be stealing i/o and cpu from hbase cluster?  Is
that possible?

> Note that Hbase is complaining
> about multiple data nodes (10.1.104.1, 10.1.104.2, 10.1.104.5), I
> attached the logs from just one of them but it's more or less the same
> on all. Please see the NN log for the same block below.
>

An attachment? Did it come through?  Perhaps pastebin it and then add link here?


> We are using Hadoop 0.20.2-CDH3B4
> and Hbase Version 0.90.2-SNAPSHOT, rUnknown, Wed Mar 23 06:09:51 EDT
> 2011 (I built that from the 0.90.2 branch to try to fix a problem with
> replication)
>




> </configuration>
>

Your configuration looks fine.

Can you associate the block with a file?  I don't see the association
in the below.  I see us trying to delete the block (would like to know
why -- file deleteion?) and then it does exist on .2 for whatever
reason.

St.Ack

>
> This is the NN log for the same block:
> 2011-04-10 07:29:23,835 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK* NameSystem.allocateBlock:
> /hbase/.logs/hadoop1-s01.farm-ny.gigya.com,60020,1302185988579/hadoop1-s01.farm-ny.gigya.com%3A60020.1302434963279.
> blk_1213779416283711358_54194
> 2011-04-10 10:12:55,749 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK* blk_1213779416283711358_54194 recovery started,
> primary=10.1.104.1:50010
> 2011-04-10 10:12:58,292 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK* NameSystem.addStoredBlock: Targets updated: block
> blk_1213779416283711358_54249 on 10.1.104.1:50010 is added as a target
> for block blk_1213779416283711358_54194 with size 162696
> 2011-04-10 10:12:58,293 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK* NameSystem.addStoredBlock: Targets updated: block
> blk_1213779416283711358_54249 on 10.1.104.5:50010 is added as a target
> for block blk_1213779416283711358_54194 with size 162696
> 2011-04-10 10:12:58,294 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK* NameSystem.addStoredBlock: Targets updated: block
> blk_1213779416283711358_54249 on 10.1.104.2:50010 is added as a target
> for block blk_1213779416283711358_54194 with size 162696
> 2011-04-10 10:12:58,295 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
> newgenerationstamp=54249, newlength=162696,
> newtargets=[10.1.104.1:50010, 10.1.104.5:50010, 10.1.104.2:50010],
> closeFile=true, deleteBlock=false)
> 2011-04-10 10:12:58,340 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> commitBlockSynchronization(newblock=blk_1213779416283711358_54249,
> file=/hbase/.logs/hadoop1-s01.farm-ny.gigya.com,60020,1302185988579/hadoop1-s01.farm-ny.gigya.com%3A60020.1302434963279,
> newgenerationstamp=54249, newlength=162696,
> newtargets=[10.1.104.1:50010, 10.1.104.5:50010, 10.1.104.2:50010])
> successful
> 2011-04-10 10:12:58,773 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> commitBlockSynchronization(lastblock=blk_1213779416283711358_54249,
> newgenerationstamp=54250, newlength=162696,
> newtargets=[10.1.104.1:50010, 10.1.104.5:50010, 10.1.104.2:50010],
> closeFile=false, deleteBlock=false)
> 2011-04-10 10:12:58,798 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> commitBlockSynchronization(blk_1213779416283711358_54250) successful
> 2011-04-10 10:12:58,833 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.1.104.2:50010
> is added to blk_1213779416283711358_54250 size 162696
> 2011-04-10 10:12:58,834 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.1.104.5:50010
> is added to blk_1213779416283711358_54250 size 162696
> 2011-04-10 10:12:58,834 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.1.104.1:50010
> is added to blk_1213779416283711358_54250 size 162696
> 2011-04-10 10:13:28,277 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK* NameSystem.addToInvalidates: blk_1213779416283711358 is added
> to invalidSet of 10.1.104.2:50010
> 2011-04-10 10:13:28,277 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK* NameSystem.addToInvalidates: blk_1213779416283711358 is added
> to invalidSet of 10.1.104.5:50010
> 2011-04-10 10:13:28,277 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK* NameSystem.addToInvalidates: blk_1213779416283711358 is added
> to invalidSet of 10.1.104.1:50010
> 2011-04-10 10:13:29,811 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK* ask 10.1.104.2:50010 to delete  blk_-826484977486799080_54240
> blk_-3859382357539098123_54247 blk_6239319017569949674_54242
> blk_2498532690928849052_54243 blk_-4133675075382826161_54234
> blk_-6299622736350943739_54246 blk_1213779416283711358_54250
> blk_-7407505773873780456_54244 blk_1070089292787245089_54237
> 2011-04-10 10:13:32,813 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK* ask 10.1.104.1:50010 to delete  blk_-826484977486799080_54240
> blk_-9184330013088281910_54231 blk_842002398362359267_54239
> blk_6239319017569949674_54242 blk_-4133675075382826161_54234
> blk_-2420408100786396293_54245 blk_1213779416283711358_54250
> blk_1475680055217098335_54241 blk_-8623426430834046353_54236
> blk_-3859382357539098123_54247 blk_2498532690928849052_54243
> blk_-6299622736350943739_54246 blk_-7407505773873780456_54244
> blk_1070089292787245089_54237
> 2011-04-10 10:13:32,813 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK* ask 10.1.104.5:50010 to delete  blk_-8623426430834046353_54236
> blk_-3859382357539098123_54247 blk_-2420408100786396293_54245
> blk_-4133675075382826161_54234 blk_1213779416283711358_54250
> 2011-04-10 10:14:29,820 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
> newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
> deleteBlock=true)
> 2011-04-10 10:14:29,820 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 8 on 8020, call
> commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
> true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@b8682c8) from
> 10.1.104.2:46012: error: java.io.IOException: Block
> (=blk_1213779416283711358_54194) not found
> java.io.IOException: Block (=blk_1213779416283711358_54194) not found
> 2011-04-10 10:14:29,835 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
> newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
> deleteBlock=true)
> 2011-04-10 10:14:29,836 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 6 on 8020, call
> commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
> true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@2bf41e61) from
> 10.1.104.2:46012: error: java.io.IOException: Block
> (=blk_1213779416283711358_54194) not found
> java.io.IOException: Block (=blk_1213779416283711358_54194) not found
> 2011-04-10 10:14:30,862 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
> newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
> deleteBlock=true)
> 2011-04-10 10:14:30,862 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 2 on 8020, call
> commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
> true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@27f688d2) from
> 10.1.104.2:46012: error: java.io.IOException: Block
> (=blk_1213779416283711358_54194) not found
> java.io.IOException: Block (=blk_1213779416283711358_54194) not found
> 2011-04-10 10:14:31,888 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
> newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
> deleteBlock=true)
> 2011-04-10 10:14:31,888 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 5 on 8020, call
> commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
> true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@2a90a77d) from
> 10.1.104.2:46012: error: java.io.IOException: Block
> (=blk_1213779416283711358_54194) not found
> java.io.IOException: Block (=blk_1213779416283711358_54194) not found
> 2011-04-10 10:14:32,913 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
> newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
> deleteBlock=true)
> 2011-04-10 10:14:32,913 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 0 on 8020, call
> commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
> true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@7ee4155f) from
> 10.1.104.2:46012: error: java.io.IOException: Block
> (=blk_1213779416283711358_54194) not found
> java.io.IOException: Block (=blk_1213779416283711358_54194) not found
> 2011-04-10 10:14:33,943 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
> newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
> deleteBlock=true)
> 2011-04-10 10:14:33,943 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 3 on 8020, call
> commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
> true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@3037510e) from
> 10.1.104.2:46012: error: java.io.IOException: Block
> (=blk_1213779416283711358_54194) not found
> java.io.IOException: Block (=blk_1213779416283711358_54194) not found
> 2011-04-10 10:14:34,968 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
> newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
> deleteBlock=true)
> 2011-04-10 10:14:34,968 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 9 on 8020, call
> commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
> true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@7faa86dc) from
> 10.1.104.1:42703: error: java.io.IOException: Block
> (=blk_1213779416283711358_54194) not found
> java.io.IOException: Block (=blk_1213779416283711358_54194) not found
> 2011-04-10 10:14:35,992 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
> newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
> deleteBlock=true)
> 2011-04-10 10:14:35,993 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 6 on 8020, call
> commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
> true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@147c16ff) from
> 10.1.104.1:42703: error: java.io.IOException: Block
> (=blk_1213779416283711358_54194) not found
> java.io.IOException: Block (=blk_1213779416283711358_54194) not found
> 2011-04-10 10:14:37,005 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
> newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
> deleteBlock=true)
> 2011-04-10 10:14:37,005 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 9 on 8020, call
> commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
> true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@5ec9c9da) from
> 10.1.104.1:42703: error: java.io.IOException: Block
> (=blk_1213779416283711358_54194) not found
> java.io.IOException: Block (=blk_1213779416283711358_54194) not found
> 2011-04-10 10:14:38,016 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
> newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
> deleteBlock=true)
> 2011-04-10 10:14:38,016 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 5 on 8020, call
> commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
> true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@5c664878) from
> 10.1.104.1:42703: error: java.io.IOException: Block
> (=blk_1213779416283711358_54194) not found
> java.io.IOException: Block (=blk_1213779416283711358_54194) not found
> 2011-04-10 10:14:39,029 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
> newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
> deleteBlock=true)
> 2011-04-10 10:14:39,029 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 2 on 8020, call
> commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
> true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@3f1fedac) from
> 10.1.104.1:42703: error: java.io.IOException: Block
> (=blk_1213779416283711358_54194) not found
> java.io.IOException: Block (=blk_1213779416283711358_54194) not found
> 2011-04-10 10:14:40,039 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
> newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
> deleteBlock=true)
> 2011-04-10 10:14:40,039 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 2 on 8020, call
> commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
> true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@1efd5b14) from
> 10.1.104.1:42703: error: java.io.IOException: Block
> (=blk_1213779416283711358_54194) not found
> java.io.IOException: Block (=blk_1213779416283711358_54194) not found
> 2011-04-10 10:14:41,050 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
> newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
> deleteBlock=true)
> 2011-04-10 10:14:41,050 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 3 on 8020, call
> commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
> true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@b668df1) from
> 10.1.104.5:54533: error: java.io.IOException: Block
> (=blk_1213779416283711358_54194) not found
> java.io.IOException: Block (=blk_1213779416283711358_54194) not found
> 2011-04-10 10:14:42,060 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
> newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
> deleteBlock=true)
> 2011-04-10 10:14:42,060 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 6 on 8020, call
> commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
> true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@79f0cf0f) from
> 10.1.104.5:54533: error: java.io.IOException: Block
> (=blk_1213779416283711358_54194) not found
> java.io.IOException: Block (=blk_1213779416283711358_54194) not found
> 2011-04-10 10:14:43,069 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
> newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
> deleteBlock=true)
> 2011-04-10 10:14:43,069 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 0 on 8020, call
> commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
> true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@30aa2054) from
> 10.1.104.5:54533: error: java.io.IOException: Block
> (=blk_1213779416283711358_54194) not found
> java.io.IOException: Block (=blk_1213779416283711358_54194) not found
> 2011-04-10 10:14:44,078 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
> newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
> deleteBlock=true)
> 2011-04-10 10:14:44,078 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 8 on 8020, call
> commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
> true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@713083c1) from
> 10.1.104.5:54533: error: java.io.IOException: Block
> (=blk_1213779416283711358_54194) not found
> java.io.IOException: Block (=blk_1213779416283711358_54194) not found
> 2011-04-10 10:14:45,088 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
> newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
> deleteBlock=true)
> 2011-04-10 10:14:45,088 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 8 on 8020, call
> commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
> true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@4ebf4464) from
> 10.1.104.5:54533: error: java.io.IOException: Block
> (=blk_1213779416283711358_54194) not found
> java.io.IOException: Block (=blk_1213779416283711358_54194) not found
> 2011-04-10 10:14:46,098 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
> newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
> deleteBlock=true)
> 2011-04-10 10:14:46,098 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 5 on 8020, call
> commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
> true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@6f23299f) from
> 10.1.104.5:54533: error: java.io.IOException: Block
> (=blk_1213779416283711358_54194) not found
> java.io.IOException: Block (=blk_1213779416283711358_54194) not found
>
>
>
> -eran
>
>
>
>
> On Sun, Apr 10, 2011 at 22:27, Jean-Daniel Cryans <jd...@apache.org> wrote:
>> My experience in debugging those kind of issues is that 95% of the
>> time it's a configuration issue, 4.99% of the time it's environment
>> and network issues (network splits, lost packets, etc), and the
>> remaining 0.01% is actual HDFS issues.
>>
>> The fact that you're saying that you had issues even with no load
>> makes me think it's a configuration issue. Can we see your hdfs
>> config?
>>
>> BTW the HBase log was pointing at 10.1.104.1 as the one having an
>> issue, is that the log we are looking at? (it doesn't seem so)
>>
>> Thx,
>>
>> J-D
>>
>> On Sun, Apr 10, 2011 at 12:05 PM, Eran Kutner <er...@gaigya.com> wrote:
>>> This is how it's configured in /etc/security/limits.con on all the
>>> slaves in the cluster:
>>> hadoop           -       nofile          32768
>>> hdfs             -       nofile          32768
>>> hbase            -       nofile          32768
>>> hadoop           -       nproc           32000
>>> hdfs           -       nproc           32000
>>> hbase           -       nproc           32000
>>>
>>> When hbase is loading it prints:
>>> ulimit -n 32768
>>>
>>>
>>> -eran
>>
>

Re: Cluster crash

Posted by Eran Kutner <er...@gigya.com>.
Hi St.Ack and J-D,
Thanks for looking into this.

It can definitely be a configuration problem, but I seriously doubt it
is a network or infrastructure problem. It's our own operated
infrastructure (not a cloud)  and we have a lot of other services
running on it without any problem. Note that Hbase is complaining
about multiple data nodes (10.1.104.1, 10.1.104.2, 10.1.104.5), I
attached the logs from just one of them but it's more or less the same
on all. Please see the NN log for the same block below.

We are using Hadoop 0.20.2-CDH3B4
and Hbase Version 0.90.2-SNAPSHOT, rUnknown, Wed Mar 23 06:09:51 EDT
2011 (I built that from the 0.90.2 branch to try to fix a problem with
replication)



This is the HDFS config file:
# cat /etc/hadoop/conf/hdfs-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>3</value>
  </property>
  <property>
     <name>dfs.permissions</name>
     <value>false</value>
  </property>
  <property>
     <!-- specify this so that running 'hadoop namenode -format'
formats the right dir -->
     <name>dfs.name.dir</name>
     <value>/hadoop/dfs/name,/hadoop/dfs/name2</value>
  </property>
  <property>
    <name>dfs.data.dir</name>
    <value>/hadoop/dfs/data/</value>
  </property>
  <property>
    <name>fs.checkpoint.dir</name>
    <value>/hadoop/dfs/namesecondary/</value>
  </property>
  <property>
    <name>dfs.datanode.max.xcievers</name>
    <value>4096</value>
  </property>
</configuration>


This is the NN log for the same block:
2011-04-10 07:29:23,835 INFO org.apache.hadoop.hdfs.StateChange:
BLOCK* NameSystem.allocateBlock:
/hbase/.logs/hadoop1-s01.farm-ny.gigya.com,60020,1302185988579/hadoop1-s01.farm-ny.gigya.com%3A60020.1302434963279.
blk_1213779416283711358_54194
2011-04-10 10:12:55,749 INFO org.apache.hadoop.hdfs.StateChange:
BLOCK* blk_1213779416283711358_54194 recovery started,
primary=10.1.104.1:50010
2011-04-10 10:12:58,292 INFO org.apache.hadoop.hdfs.StateChange:
BLOCK* NameSystem.addStoredBlock: Targets updated: block
blk_1213779416283711358_54249 on 10.1.104.1:50010 is added as a target
for block blk_1213779416283711358_54194 with size 162696
2011-04-10 10:12:58,293 INFO org.apache.hadoop.hdfs.StateChange:
BLOCK* NameSystem.addStoredBlock: Targets updated: block
blk_1213779416283711358_54249 on 10.1.104.5:50010 is added as a target
for block blk_1213779416283711358_54194 with size 162696
2011-04-10 10:12:58,294 INFO org.apache.hadoop.hdfs.StateChange:
BLOCK* NameSystem.addStoredBlock: Targets updated: block
blk_1213779416283711358_54249 on 10.1.104.2:50010 is added as a target
for block blk_1213779416283711358_54194 with size 162696
2011-04-10 10:12:58,295 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
newgenerationstamp=54249, newlength=162696,
newtargets=[10.1.104.1:50010, 10.1.104.5:50010, 10.1.104.2:50010],
closeFile=true, deleteBlock=false)
2011-04-10 10:12:58,340 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
commitBlockSynchronization(newblock=blk_1213779416283711358_54249,
file=/hbase/.logs/hadoop1-s01.farm-ny.gigya.com,60020,1302185988579/hadoop1-s01.farm-ny.gigya.com%3A60020.1302434963279,
newgenerationstamp=54249, newlength=162696,
newtargets=[10.1.104.1:50010, 10.1.104.5:50010, 10.1.104.2:50010])
successful
2011-04-10 10:12:58,773 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
commitBlockSynchronization(lastblock=blk_1213779416283711358_54249,
newgenerationstamp=54250, newlength=162696,
newtargets=[10.1.104.1:50010, 10.1.104.5:50010, 10.1.104.2:50010],
closeFile=false, deleteBlock=false)
2011-04-10 10:12:58,798 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
commitBlockSynchronization(blk_1213779416283711358_54250) successful
2011-04-10 10:12:58,833 INFO org.apache.hadoop.hdfs.StateChange:
BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.1.104.2:50010
is added to blk_1213779416283711358_54250 size 162696
2011-04-10 10:12:58,834 INFO org.apache.hadoop.hdfs.StateChange:
BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.1.104.5:50010
is added to blk_1213779416283711358_54250 size 162696
2011-04-10 10:12:58,834 INFO org.apache.hadoop.hdfs.StateChange:
BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.1.104.1:50010
is added to blk_1213779416283711358_54250 size 162696
2011-04-10 10:13:28,277 INFO org.apache.hadoop.hdfs.StateChange:
BLOCK* NameSystem.addToInvalidates: blk_1213779416283711358 is added
to invalidSet of 10.1.104.2:50010
2011-04-10 10:13:28,277 INFO org.apache.hadoop.hdfs.StateChange:
BLOCK* NameSystem.addToInvalidates: blk_1213779416283711358 is added
to invalidSet of 10.1.104.5:50010
2011-04-10 10:13:28,277 INFO org.apache.hadoop.hdfs.StateChange:
BLOCK* NameSystem.addToInvalidates: blk_1213779416283711358 is added
to invalidSet of 10.1.104.1:50010
2011-04-10 10:13:29,811 INFO org.apache.hadoop.hdfs.StateChange:
BLOCK* ask 10.1.104.2:50010 to delete  blk_-826484977486799080_54240
blk_-3859382357539098123_54247 blk_6239319017569949674_54242
blk_2498532690928849052_54243 blk_-4133675075382826161_54234
blk_-6299622736350943739_54246 blk_1213779416283711358_54250
blk_-7407505773873780456_54244 blk_1070089292787245089_54237
2011-04-10 10:13:32,813 INFO org.apache.hadoop.hdfs.StateChange:
BLOCK* ask 10.1.104.1:50010 to delete  blk_-826484977486799080_54240
blk_-9184330013088281910_54231 blk_842002398362359267_54239
blk_6239319017569949674_54242 blk_-4133675075382826161_54234
blk_-2420408100786396293_54245 blk_1213779416283711358_54250
blk_1475680055217098335_54241 blk_-8623426430834046353_54236
blk_-3859382357539098123_54247 blk_2498532690928849052_54243
blk_-6299622736350943739_54246 blk_-7407505773873780456_54244
blk_1070089292787245089_54237
2011-04-10 10:13:32,813 INFO org.apache.hadoop.hdfs.StateChange:
BLOCK* ask 10.1.104.5:50010 to delete  blk_-8623426430834046353_54236
blk_-3859382357539098123_54247 blk_-2420408100786396293_54245
blk_-4133675075382826161_54234 blk_1213779416283711358_54250
2011-04-10 10:14:29,820 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
deleteBlock=true)
2011-04-10 10:14:29,820 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 8 on 8020, call
commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@b8682c8) from
10.1.104.2:46012: error: java.io.IOException: Block
(=blk_1213779416283711358_54194) not found
java.io.IOException: Block (=blk_1213779416283711358_54194) not found
2011-04-10 10:14:29,835 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
deleteBlock=true)
2011-04-10 10:14:29,836 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 6 on 8020, call
commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@2bf41e61) from
10.1.104.2:46012: error: java.io.IOException: Block
(=blk_1213779416283711358_54194) not found
java.io.IOException: Block (=blk_1213779416283711358_54194) not found
2011-04-10 10:14:30,862 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
deleteBlock=true)
2011-04-10 10:14:30,862 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 2 on 8020, call
commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@27f688d2) from
10.1.104.2:46012: error: java.io.IOException: Block
(=blk_1213779416283711358_54194) not found
java.io.IOException: Block (=blk_1213779416283711358_54194) not found
2011-04-10 10:14:31,888 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
deleteBlock=true)
2011-04-10 10:14:31,888 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 5 on 8020, call
commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@2a90a77d) from
10.1.104.2:46012: error: java.io.IOException: Block
(=blk_1213779416283711358_54194) not found
java.io.IOException: Block (=blk_1213779416283711358_54194) not found
2011-04-10 10:14:32,913 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
deleteBlock=true)
2011-04-10 10:14:32,913 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 0 on 8020, call
commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@7ee4155f) from
10.1.104.2:46012: error: java.io.IOException: Block
(=blk_1213779416283711358_54194) not found
java.io.IOException: Block (=blk_1213779416283711358_54194) not found
2011-04-10 10:14:33,943 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
deleteBlock=true)
2011-04-10 10:14:33,943 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 3 on 8020, call
commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@3037510e) from
10.1.104.2:46012: error: java.io.IOException: Block
(=blk_1213779416283711358_54194) not found
java.io.IOException: Block (=blk_1213779416283711358_54194) not found
2011-04-10 10:14:34,968 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
deleteBlock=true)
2011-04-10 10:14:34,968 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 9 on 8020, call
commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@7faa86dc) from
10.1.104.1:42703: error: java.io.IOException: Block
(=blk_1213779416283711358_54194) not found
java.io.IOException: Block (=blk_1213779416283711358_54194) not found
2011-04-10 10:14:35,992 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
deleteBlock=true)
2011-04-10 10:14:35,993 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 6 on 8020, call
commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@147c16ff) from
10.1.104.1:42703: error: java.io.IOException: Block
(=blk_1213779416283711358_54194) not found
java.io.IOException: Block (=blk_1213779416283711358_54194) not found
2011-04-10 10:14:37,005 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
deleteBlock=true)
2011-04-10 10:14:37,005 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 9 on 8020, call
commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@5ec9c9da) from
10.1.104.1:42703: error: java.io.IOException: Block
(=blk_1213779416283711358_54194) not found
java.io.IOException: Block (=blk_1213779416283711358_54194) not found
2011-04-10 10:14:38,016 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
deleteBlock=true)
2011-04-10 10:14:38,016 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 5 on 8020, call
commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@5c664878) from
10.1.104.1:42703: error: java.io.IOException: Block
(=blk_1213779416283711358_54194) not found
java.io.IOException: Block (=blk_1213779416283711358_54194) not found
2011-04-10 10:14:39,029 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
deleteBlock=true)
2011-04-10 10:14:39,029 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 2 on 8020, call
commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@3f1fedac) from
10.1.104.1:42703: error: java.io.IOException: Block
(=blk_1213779416283711358_54194) not found
java.io.IOException: Block (=blk_1213779416283711358_54194) not found
2011-04-10 10:14:40,039 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
deleteBlock=true)
2011-04-10 10:14:40,039 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 2 on 8020, call
commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@1efd5b14) from
10.1.104.1:42703: error: java.io.IOException: Block
(=blk_1213779416283711358_54194) not found
java.io.IOException: Block (=blk_1213779416283711358_54194) not found
2011-04-10 10:14:41,050 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
deleteBlock=true)
2011-04-10 10:14:41,050 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 3 on 8020, call
commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@b668df1) from
10.1.104.5:54533: error: java.io.IOException: Block
(=blk_1213779416283711358_54194) not found
java.io.IOException: Block (=blk_1213779416283711358_54194) not found
2011-04-10 10:14:42,060 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
deleteBlock=true)
2011-04-10 10:14:42,060 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 6 on 8020, call
commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@79f0cf0f) from
10.1.104.5:54533: error: java.io.IOException: Block
(=blk_1213779416283711358_54194) not found
java.io.IOException: Block (=blk_1213779416283711358_54194) not found
2011-04-10 10:14:43,069 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
deleteBlock=true)
2011-04-10 10:14:43,069 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 0 on 8020, call
commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@30aa2054) from
10.1.104.5:54533: error: java.io.IOException: Block
(=blk_1213779416283711358_54194) not found
java.io.IOException: Block (=blk_1213779416283711358_54194) not found
2011-04-10 10:14:44,078 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
deleteBlock=true)
2011-04-10 10:14:44,078 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 8 on 8020, call
commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@713083c1) from
10.1.104.5:54533: error: java.io.IOException: Block
(=blk_1213779416283711358_54194) not found
java.io.IOException: Block (=blk_1213779416283711358_54194) not found
2011-04-10 10:14:45,088 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
deleteBlock=true)
2011-04-10 10:14:45,088 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 8 on 8020, call
commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@4ebf4464) from
10.1.104.5:54533: error: java.io.IOException: Block
(=blk_1213779416283711358_54194) not found
java.io.IOException: Block (=blk_1213779416283711358_54194) not found
2011-04-10 10:14:46,098 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
commitBlockSynchronization(lastblock=blk_1213779416283711358_54194,
newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
deleteBlock=true)
2011-04-10 10:14:46,098 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 5 on 8020, call
commitBlockSynchronization(blk_1213779416283711358_54194, 0, 0, false,
true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@6f23299f) from
10.1.104.5:54533: error: java.io.IOException: Block
(=blk_1213779416283711358_54194) not found
java.io.IOException: Block (=blk_1213779416283711358_54194) not found



-eran




On Sun, Apr 10, 2011 at 22:27, Jean-Daniel Cryans <jd...@apache.org> wrote:
> My experience in debugging those kind of issues is that 95% of the
> time it's a configuration issue, 4.99% of the time it's environment
> and network issues (network splits, lost packets, etc), and the
> remaining 0.01% is actual HDFS issues.
>
> The fact that you're saying that you had issues even with no load
> makes me think it's a configuration issue. Can we see your hdfs
> config?
>
> BTW the HBase log was pointing at 10.1.104.1 as the one having an
> issue, is that the log we are looking at? (it doesn't seem so)
>
> Thx,
>
> J-D
>
> On Sun, Apr 10, 2011 at 12:05 PM, Eran Kutner <er...@gaigya.com> wrote:
>> This is how it's configured in /etc/security/limits.con on all the
>> slaves in the cluster:
>> hadoop           -       nofile          32768
>> hdfs             -       nofile          32768
>> hbase            -       nofile          32768
>> hadoop           -       nproc           32000
>> hdfs           -       nproc           32000
>> hbase           -       nproc           32000
>>
>> When hbase is loading it prints:
>> ulimit -n 32768
>>
>>
>> -eran
>

Re: Cluster crash

Posted by Jean-Daniel Cryans <jd...@apache.org>.
My experience in debugging those kind of issues is that 95% of the
time it's a configuration issue, 4.99% of the time it's environment
and network issues (network splits, lost packets, etc), and the
remaining 0.01% is actual HDFS issues.

The fact that you're saying that you had issues even with no load
makes me think it's a configuration issue. Can we see your hdfs
config?

BTW the HBase log was pointing at 10.1.104.1 as the one having an
issue, is that the log we are looking at? (it doesn't seem so)

Thx,

J-D

On Sun, Apr 10, 2011 at 12:05 PM, Eran Kutner <er...@gaigya.com> wrote:
> This is how it's configured in /etc/security/limits.con on all the
> slaves in the cluster:
> hadoop           -       nofile          32768
> hdfs             -       nofile          32768
> hbase            -       nofile          32768
> hadoop           -       nproc           32000
> hdfs           -       nproc           32000
> hbase           -       nproc           32000
>
> When hbase is loading it prints:
> ulimit -n 32768
>
>
> -eran

Re: Cluster crash

Posted by Eran Kutner <er...@gigya.com>.
This is how it's configured in /etc/security/limits.con on all the
slaves in the cluster:
hadoop           -       nofile          32768
hdfs             -       nofile          32768
hbase            -       nofile          32768
hadoop           -       nproc           32000
hdfs           -       nproc           32000
hbase           -       nproc           32000

When hbase is loading it prints:
ulimit -n 32768


-eran




On Sun, Apr 10, 2011 at 21:54, Stack <st...@duboce.net> wrote:
> Did you read the requirements section [1] and verify that indeed
> ulimit and nprocs for the user who owns hbase and hadoop processes has
> indeed the upped limits?
>
> Yours,
> St.Ack
>
> 1. http://hbase.apache.org/book/notsoquick.html#requirements
>
> On Sun, Apr 10, 2011 at 8:07 AM, Eran Kutner <er...@gigya.com> wrote:
>> Hi,
>> While doing load testing on HBase the entire cluster crashed with
>> errors like these in hbase logs:
>>
>> 2011-04-10 10:14:30,844 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_1213779416283711358_54194 bad datanode[0]
>> 10.1.104.1:50010
>> 2011-04-10 10:14:30,844 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_1213779416283711358_54194 in pipeline
>> 10.1.104.1:50010, 10.1.104.5:50010, 10.1.104.2:50010: bad datanode
>> 10.1.104.1:50010
>> 2011-04-10 10:14:30,880 WARN org.apache.hadoop.hdfs.DFSClient: Failed
>> recovery attempt #2 from primary datanode 10.1.104.2:50010
>> org.apache.hadoop.ipc.RemoteException:
>> org.apache.hadoop.ipc.RemoteException: java.io.IOException: Block
>> (=blk_1213779416283711358_54194) not found
>>        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.commitBlockSynchronization(FSNamesystem.java:2099)
>>        at org.apache.hadoop.hdfs.server.namenode.NameNode.commitBlockSynchronization(NameNode.java:703)
>>        at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source)
>>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
>>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1416)
>>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1412)
>>        at java.security.AccessController.doPrivileged(Native Method)
>>        at javax.security.auth.Subject.doAs(Subject.java:396)
>>        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
>>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1410)
>>
>>        at org.apache.hadoop.ipc.Client.call(Client.java:1104)
>>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
>>        at $Proxy4.commitBlockSynchronization(Unknown Source)
>>        at org.apache.hadoop.hdfs.server.datanode.DataNode.syncBlock(DataNode.java:1847)
>>        at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1828)
>>        at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1924)
>>        at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
>>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
>>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1416)
>>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1412)
>>        at java.security.AccessController.doPrivileged(Native Method)
>>        at javax.security.auth.Subject.doAs(Subject.java:396)
>>        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
>>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1410)
>>
>>        at org.apache.hadoop.ipc.Client.call(Client.java:1104)
>>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
>>        at $Proxy8.recoverBlock(Unknown Source)
>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2785)
>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2289)
>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2461)
>> 2011-04-10 10:14:30,880 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_1213779416283711358_54194 failed  because
>> recovery from primary datanode 10.1.104.2:50010 failed 3 times.
>> Pipeline was 10.1.104.1:50010, 10.1.104.5:50010, 10.1.104.2:50010.
>> Will retry...
>> 2011-04-10 10:14:31,882 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_1213779416283711358_54194 bad datanode[0]
>> 10.1.104.1:50010
>> 2011-04-10 10:14:31,882 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_1213779416283711358_54194 in pipeline
>> 10.1.104.1:50010, 10.1.104.5:50010, 10.1.104.2:50010: bad datanode
>> 10.1.104.1:50010
>> 2011-04-10 10:14:31,905 WARN org.apache.hadoop.hdfs.DFSClient: Failed
>> recovery attempt #3 from primary datanode 10.1.104.2:50010
>> org.apache.hadoop.ipc.RemoteException:
>> org.apache.hadoop.ipc.RemoteException: java.io.IOException: Block
>> (=blk_1213779416283711358_54194) not found
>>        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.commitBlockSynchronization(FSNamesystem.java:2099)
>>        at org.apache.hadoop.hdfs.server.namenode.NameNode.commitBlockSynchronization(NameNode.java:703)
>>        at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source)
>>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
>>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1416)
>>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1412)
>>        at java.security.AccessController.doPrivileged(Native Method)
>>        at javax.security.auth.Subject.doAs(Subject.java:396)
>>        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
>>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1410)
>>
>>        at org.apache.hadoop.ipc.Client.call(Client.java:1104)
>>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
>>        at $Proxy4.commitBlockSynchronization(Unknown Source)
>>        at org.apache.hadoop.hdfs.server.datanode.DataNode.syncBlock(DataNode.java:1847)
>>        at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1828)
>>        at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1924)
>>        at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
>>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
>>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1416)
>>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1412)
>>        at java.security.AccessController.doPrivileged(Native Method)
>>        at javax.security.auth.Subject.doAs(Subject.java:396)
>>        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
>>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1410)
>>
>>        at org.apache.hadoop.ipc.Client.call(Client.java:1104)
>>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
>>        at $Proxy8.recoverBlock(Unknown Source)
>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2785)
>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2289)
>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2461)
>> 2011-04-10 10:14:31,905 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_1213779416283711358_54194 failed  because
>> recovery from primary datanode 10.1.104.2:50010 failed 4 times.
>> Pipeline was 10.1.104.1:50010, 10.1.104.5:50010, 10.1.104.2:50010.
>> Will retry...
>>
>> The Datanode log for this block ID shows:
>> # cat  /var/log/hadoop/hadoop-hadoop-datanode-hadoop1-s02.log | grep
>> 1213779416283711358
>> 2011-04-10 07:29:23,858 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
>> blk_1213779416283711358_54194 src: /10.1.104.5:40533 dest:
>> /10.1.104.2:50010
>> 2011-04-10 10:12:58,242 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
>> receiveBlock for block blk_1213779416283711358_54194
>> java.io.EOFException: while trying to read 871 bytes
>> 2011-04-10 10:12:58,242 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for
>> block blk_1213779416283711358_54194 Interrupted.
>> 2011-04-10 10:12:58,242 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for
>> block blk_1213779416283711358_54194 terminating
>> 2011-04-10 10:12:58,242 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
>> blk_1213779416283711358_54194 received exception java.io.EOFException:
>> while trying to read 871 bytes
>> 2011-04-10 10:12:58,288 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode:
>> oldblock=blk_1213779416283711358_54194(length=162696),
>> newblock=blk_1213779416283711358_54249(length=162696),
>> datanode=10.1.104.2:50010
>> 2011-04-10 10:12:58,288 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
>> blk_1213779416283711358_54249 of size 162696 as part of lease
>> recovery.
>> 2011-04-10 10:12:58,766 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode:
>> oldblock=blk_1213779416283711358_54249(length=162696),
>> newblock=blk_1213779416283711358_54250(length=162696),
>> datanode=10.1.104.2:50010
>> 2011-04-10 10:12:58,803 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
>> blk_1213779416283711358_54250 src: /10.1.104.5:33159 dest:
>> /10.1.104.2:50010
>> 2011-04-10 10:12:58,805 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: Reopen Block for
>> append blk_1213779416283711358_54250
>> 2011-04-10 10:12:58,827 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: setBlockPosition
>> trying to set position to 162696 for block
>> blk_1213779416283711358_54250 which is not a multiple of
>> bytesPerChecksum 512
>> 2011-04-10 10:12:58,827 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode:
>> computePartialChunkCrc sizePartialChunk 392 block
>> blk_1213779416283711358_54250 offset in block 162304 offset in
>> metafile 1275
>> 2011-04-10 10:12:58,827 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: Read in partial CRC
>> chunk from disk for block blk_1213779416283711358_54250
>> 2011-04-10 10:12:58,828 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>> /10.1.104.5:33159, dest: /10.1.104.2:50010, bytes: 162696, op:
>> HDFS_WRITE, cliID:
>> DFSClient_hb_m_hadoop1-hbm1.farm-ny.gigya.com:60000_1301599306867,
>> offset: 0, srvID: DS-1554836204-10.1.104.2-50010-1298986316425,
>> blockid: blk_1213779416283711358_54250, duration: 21639000
>> 2011-04-10 10:12:58,828 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for
>> block blk_1213779416283711358_54250 terminating
>> 2011-04-10 10:13:31,080 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
>> blk_1213779416283711358_54250 file
>> /hadoop/dfs/data/current/subdir61/subdir17/blk_1213779416283711358 for
>> deletion
>> 2011-04-10 10:13:31,250 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
>> blk_1213779416283711358_54250 at file
>> /hadoop/dfs/data/current/subdir61/subdir17/blk_1213779416283711358
>> 2011-04-10 10:14:29,811 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: Client calls
>> recoverBlock(block=blk_1213779416283711358_54194,
>> targets=[10.1.104.5:50010, 10.1.104.2:50010])
>> 2011-04-10 10:14:29,814 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
>> found for block blk_1213779416283711358_54194 on datanode
>> 10.1.104.5:50010
>> 2011-04-10 10:14:29,814 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
>> found for block blk_1213779416283711358_54194 on datanode
>> 10.1.104.2:50010
>> 2011-04-10 10:14:29,815 INFO org.apache.hadoop.ipc.Server: IPC Server
>> handler 0 on 50020, call recoverBlock(blk_1213779416283711358_54194,
>> false, [Lorg.apache.hadoop.hdfs.protocol.DatanodeInfo;@c754900) from
>> 10.1.104.1:34481: error: org.apache.hadoop.ipc.RemoteException:
>> java.io.IOException: Block (=blk_1213779416283711358_54194) not found
>> org.apache.hadoop.ipc.RemoteException: java.io.IOException: Block
>> (=blk_1213779416283711358_54194) not found
>> 2011-04-10 10:14:29,828 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: Client calls
>> recoverBlock(block=blk_1213779416283711358_54194,
>> targets=[10.1.104.5:50010, 10.1.104.2:50010])
>> 2011-04-10 10:14:29,830 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
>> found for block blk_1213779416283711358_54194 on datanode
>> 10.1.104.5:50010
>> 2011-04-10 10:14:29,830 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
>> found for block blk_1213779416283711358_54194 on datanode
>> 10.1.104.2:50010
>> 2011-04-10 10:14:29,831 INFO org.apache.hadoop.ipc.Server: IPC Server
>> handler 1 on 50020, call recoverBlock(blk_1213779416283711358_54194,
>> false, [Lorg.apache.hadoop.hdfs.protocol.DatanodeInfo;@516485d9) from
>> 10.1.104.1:34489: error: org.apache.hadoop.ipc.RemoteException:
>> java.io.IOException: Block (=blk_1213779416283711358_54194) not found
>> org.apache.hadoop.ipc.RemoteException: java.io.IOException: Block
>> (=blk_1213779416283711358_54194) not found
>> 2011-04-10 10:14:30,854 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: Client calls
>> recoverBlock(block=blk_1213779416283711358_54194,
>> targets=[10.1.104.5:50010, 10.1.104.2:50010])
>> 2011-04-10 10:14:30,856 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
>> found for block blk_1213779416283711358_54194 on datanode
>> 10.1.104.5:50010
>> 2011-04-10 10:14:30,856 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
>> found for block blk_1213779416283711358_54194 on datanode
>> 10.1.104.2:50010
>> 2011-04-10 10:14:30,857 INFO org.apache.hadoop.ipc.Server: IPC Server
>> handler 2 on 50020, call recoverBlock(blk_1213779416283711358_54194,
>> false, [Lorg.apache.hadoop.hdfs.protocol.DatanodeInfo;@62ee365a) from
>> 10.1.104.1:34499: error: org.apache.hadoop.ipc.RemoteException:
>> java.io.IOException: Block (=blk_1213779416283711358_54194) not found
>> org.apache.hadoop.ipc.RemoteException: java.io.IOException: Block
>> (=blk_1213779416283711358_54194) not found
>> 2011-04-10 10:14:31,879 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: Client calls
>> recoverBlock(block=blk_1213779416283711358_54194,
>> targets=[10.1.104.5:50010, 10.1.104.2:50010])
>> 2011-04-10 10:14:31,882 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
>> found for block blk_1213779416283711358_54194 on datanode
>> 10.1.104.5:50010
>> 2011-04-10 10:14:31,882 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
>> found for block blk_1213779416283711358_54194 on datanode
>> 10.1.104.2:50010
>> 2011-04-10 10:14:31,883 INFO org.apache.hadoop.ipc.Server: IPC Server
>> handler 0 on 50020, call recoverBlock(blk_1213779416283711358_54194,
>> false, [Lorg.apache.hadoop.hdfs.protocol.DatanodeInfo;@ed5547f) from
>> 10.1.104.1:34500: error: org.apache.hadoop.ipc.RemoteException:
>> java.io.IOException: Block (=blk_1213779416283711358_54194) not found
>> org.apache.hadoop.ipc.RemoteException: java.io.IOException: Block
>> (=blk_1213779416283711358_54194) not found
>> 2011-04-10 10:14:32,905 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: Client calls
>> recoverBlock(block=blk_1213779416283711358_54194,
>> targets=[10.1.104.5:50010, 10.1.104.2:50010])
>> 2011-04-10 10:14:32,907 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
>> found for block blk_1213779416283711358_54194 on datanode
>> 10.1.104.5:50010
>> 2011-04-10 10:14:32,907 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
>> found for block blk_1213779416283711358_54194 on datanode
>> 10.1.104.2:50010
>> 2011-04-10 10:14:32,908 INFO org.apache.hadoop.ipc.Server: IPC Server
>> handler 1 on 50020, call recoverBlock(blk_1213779416283711358_54194,
>> false, [Lorg.apache.hadoop.hdfs.protocol.DatanodeInfo;@2dfd9614) from
>> 10.1.104.1:34511: error: org.apache.hadoop.ipc.RemoteException:
>> java.io.IOException: Block (=blk_1213779416283711358_54194) not found
>> org.apache.hadoop.ipc.RemoteException: java.io.IOException: Block
>> (=blk_1213779416283711358_54194) not found
>> 2011-04-10 10:14:33,935 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: Client calls
>> recoverBlock(block=blk_1213779416283711358_54194,
>> targets=[10.1.104.5:50010, 10.1.104.2:50010])
>> 2011-04-10 10:14:33,937 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
>> found for block blk_1213779416283711358_54194 on datanode
>> 10.1.104.5:50010
>> 2011-04-10 10:14:33,937 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
>> found for block blk_1213779416283711358_54194 on datanode
>> 10.1.104.2:50010
>> 2011-04-10 10:14:33,938 INFO org.apache.hadoop.ipc.Server: IPC Server
>> handler 2 on 50020, call recoverBlock(blk_1213779416283711358_54194,
>> false, [Lorg.apache.hadoop.hdfs.protocol.DatanodeInfo;@22fe47f3) from
>> 10.1.104.1:34522: error: org.apache.hadoop.ipc.RemoteException:
>> java.io.IOException: Block (=blk_1213779416283711358_54194) not found
>> org.apache.hadoop.ipc.RemoteException: java.io.IOException: Block
>> (=blk_1213779416283711358_54194) not found
>>
>>
>>
>> Running "hadoop fsck /" says the filesystem is healthy.
>>
>> The same kind of errors happened on all the region servers but for
>> different blocks and against differnt datanodes, eventually the entire
>> cluster shut itself down. This is the second time I'm having this
>> problem while running load tests on the system. Previous time, once it
>> started happening it would happen every hour or even without any load
>> being run on the cluster, presumably during compaction. I wasn't sure
>> if it was somehow my fault so I truncated the entire table which
>> stopped the cluster crashes but now it's happening again.
>>
>> Any idea what's going on?
>>
>> -eran
>>
>

Re: Cluster crash

Posted by Stack <st...@duboce.net>.
Did you read the requirements section [1] and verify that indeed
ulimit and nprocs for the user who owns hbase and hadoop processes has
indeed the upped limits?

Yours,
St.Ack

1. http://hbase.apache.org/book/notsoquick.html#requirements

On Sun, Apr 10, 2011 at 8:07 AM, Eran Kutner <er...@gigya.com> wrote:
> Hi,
> While doing load testing on HBase the entire cluster crashed with
> errors like these in hbase logs:
>
> 2011-04-10 10:14:30,844 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_1213779416283711358_54194 bad datanode[0]
> 10.1.104.1:50010
> 2011-04-10 10:14:30,844 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_1213779416283711358_54194 in pipeline
> 10.1.104.1:50010, 10.1.104.5:50010, 10.1.104.2:50010: bad datanode
> 10.1.104.1:50010
> 2011-04-10 10:14:30,880 WARN org.apache.hadoop.hdfs.DFSClient: Failed
> recovery attempt #2 from primary datanode 10.1.104.2:50010
> org.apache.hadoop.ipc.RemoteException:
> org.apache.hadoop.ipc.RemoteException: java.io.IOException: Block
> (=blk_1213779416283711358_54194) not found
>        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.commitBlockSynchronization(FSNamesystem.java:2099)
>        at org.apache.hadoop.hdfs.server.namenode.NameNode.commitBlockSynchronization(NameNode.java:703)
>        at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source)
>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1416)
>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1412)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at javax.security.auth.Subject.doAs(Subject.java:396)
>        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1410)
>
>        at org.apache.hadoop.ipc.Client.call(Client.java:1104)
>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
>        at $Proxy4.commitBlockSynchronization(Unknown Source)
>        at org.apache.hadoop.hdfs.server.datanode.DataNode.syncBlock(DataNode.java:1847)
>        at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1828)
>        at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1924)
>        at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1416)
>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1412)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at javax.security.auth.Subject.doAs(Subject.java:396)
>        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1410)
>
>        at org.apache.hadoop.ipc.Client.call(Client.java:1104)
>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
>        at $Proxy8.recoverBlock(Unknown Source)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2785)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2289)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2461)
> 2011-04-10 10:14:30,880 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_1213779416283711358_54194 failed  because
> recovery from primary datanode 10.1.104.2:50010 failed 3 times.
> Pipeline was 10.1.104.1:50010, 10.1.104.5:50010, 10.1.104.2:50010.
> Will retry...
> 2011-04-10 10:14:31,882 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_1213779416283711358_54194 bad datanode[0]
> 10.1.104.1:50010
> 2011-04-10 10:14:31,882 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_1213779416283711358_54194 in pipeline
> 10.1.104.1:50010, 10.1.104.5:50010, 10.1.104.2:50010: bad datanode
> 10.1.104.1:50010
> 2011-04-10 10:14:31,905 WARN org.apache.hadoop.hdfs.DFSClient: Failed
> recovery attempt #3 from primary datanode 10.1.104.2:50010
> org.apache.hadoop.ipc.RemoteException:
> org.apache.hadoop.ipc.RemoteException: java.io.IOException: Block
> (=blk_1213779416283711358_54194) not found
>        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.commitBlockSynchronization(FSNamesystem.java:2099)
>        at org.apache.hadoop.hdfs.server.namenode.NameNode.commitBlockSynchronization(NameNode.java:703)
>        at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source)
>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1416)
>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1412)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at javax.security.auth.Subject.doAs(Subject.java:396)
>        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1410)
>
>        at org.apache.hadoop.ipc.Client.call(Client.java:1104)
>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
>        at $Proxy4.commitBlockSynchronization(Unknown Source)
>        at org.apache.hadoop.hdfs.server.datanode.DataNode.syncBlock(DataNode.java:1847)
>        at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1828)
>        at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1924)
>        at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1416)
>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1412)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at javax.security.auth.Subject.doAs(Subject.java:396)
>        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1410)
>
>        at org.apache.hadoop.ipc.Client.call(Client.java:1104)
>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
>        at $Proxy8.recoverBlock(Unknown Source)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2785)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2289)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2461)
> 2011-04-10 10:14:31,905 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_1213779416283711358_54194 failed  because
> recovery from primary datanode 10.1.104.2:50010 failed 4 times.
> Pipeline was 10.1.104.1:50010, 10.1.104.5:50010, 10.1.104.2:50010.
> Will retry...
>
> The Datanode log for this block ID shows:
> # cat  /var/log/hadoop/hadoop-hadoop-datanode-hadoop1-s02.log | grep
> 1213779416283711358
> 2011-04-10 07:29:23,858 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> blk_1213779416283711358_54194 src: /10.1.104.5:40533 dest:
> /10.1.104.2:50010
> 2011-04-10 10:12:58,242 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
> receiveBlock for block blk_1213779416283711358_54194
> java.io.EOFException: while trying to read 871 bytes
> 2011-04-10 10:12:58,242 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for
> block blk_1213779416283711358_54194 Interrupted.
> 2011-04-10 10:12:58,242 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for
> block blk_1213779416283711358_54194 terminating
> 2011-04-10 10:12:58,242 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
> blk_1213779416283711358_54194 received exception java.io.EOFException:
> while trying to read 871 bytes
> 2011-04-10 10:12:58,288 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode:
> oldblock=blk_1213779416283711358_54194(length=162696),
> newblock=blk_1213779416283711358_54249(length=162696),
> datanode=10.1.104.2:50010
> 2011-04-10 10:12:58,288 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
> blk_1213779416283711358_54249 of size 162696 as part of lease
> recovery.
> 2011-04-10 10:12:58,766 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode:
> oldblock=blk_1213779416283711358_54249(length=162696),
> newblock=blk_1213779416283711358_54250(length=162696),
> datanode=10.1.104.2:50010
> 2011-04-10 10:12:58,803 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> blk_1213779416283711358_54250 src: /10.1.104.5:33159 dest:
> /10.1.104.2:50010
> 2011-04-10 10:12:58,805 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Reopen Block for
> append blk_1213779416283711358_54250
> 2011-04-10 10:12:58,827 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: setBlockPosition
> trying to set position to 162696 for block
> blk_1213779416283711358_54250 which is not a multiple of
> bytesPerChecksum 512
> 2011-04-10 10:12:58,827 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode:
> computePartialChunkCrc sizePartialChunk 392 block
> blk_1213779416283711358_54250 offset in block 162304 offset in
> metafile 1275
> 2011-04-10 10:12:58,827 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Read in partial CRC
> chunk from disk for block blk_1213779416283711358_54250
> 2011-04-10 10:12:58,828 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
> /10.1.104.5:33159, dest: /10.1.104.2:50010, bytes: 162696, op:
> HDFS_WRITE, cliID:
> DFSClient_hb_m_hadoop1-hbm1.farm-ny.gigya.com:60000_1301599306867,
> offset: 0, srvID: DS-1554836204-10.1.104.2-50010-1298986316425,
> blockid: blk_1213779416283711358_54250, duration: 21639000
> 2011-04-10 10:12:58,828 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for
> block blk_1213779416283711358_54250 terminating
> 2011-04-10 10:13:31,080 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
> blk_1213779416283711358_54250 file
> /hadoop/dfs/data/current/subdir61/subdir17/blk_1213779416283711358 for
> deletion
> 2011-04-10 10:13:31,250 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
> blk_1213779416283711358_54250 at file
> /hadoop/dfs/data/current/subdir61/subdir17/blk_1213779416283711358
> 2011-04-10 10:14:29,811 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Client calls
> recoverBlock(block=blk_1213779416283711358_54194,
> targets=[10.1.104.5:50010, 10.1.104.2:50010])
> 2011-04-10 10:14:29,814 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
> found for block blk_1213779416283711358_54194 on datanode
> 10.1.104.5:50010
> 2011-04-10 10:14:29,814 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
> found for block blk_1213779416283711358_54194 on datanode
> 10.1.104.2:50010
> 2011-04-10 10:14:29,815 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 0 on 50020, call recoverBlock(blk_1213779416283711358_54194,
> false, [Lorg.apache.hadoop.hdfs.protocol.DatanodeInfo;@c754900) from
> 10.1.104.1:34481: error: org.apache.hadoop.ipc.RemoteException:
> java.io.IOException: Block (=blk_1213779416283711358_54194) not found
> org.apache.hadoop.ipc.RemoteException: java.io.IOException: Block
> (=blk_1213779416283711358_54194) not found
> 2011-04-10 10:14:29,828 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Client calls
> recoverBlock(block=blk_1213779416283711358_54194,
> targets=[10.1.104.5:50010, 10.1.104.2:50010])
> 2011-04-10 10:14:29,830 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
> found for block blk_1213779416283711358_54194 on datanode
> 10.1.104.5:50010
> 2011-04-10 10:14:29,830 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
> found for block blk_1213779416283711358_54194 on datanode
> 10.1.104.2:50010
> 2011-04-10 10:14:29,831 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 1 on 50020, call recoverBlock(blk_1213779416283711358_54194,
> false, [Lorg.apache.hadoop.hdfs.protocol.DatanodeInfo;@516485d9) from
> 10.1.104.1:34489: error: org.apache.hadoop.ipc.RemoteException:
> java.io.IOException: Block (=blk_1213779416283711358_54194) not found
> org.apache.hadoop.ipc.RemoteException: java.io.IOException: Block
> (=blk_1213779416283711358_54194) not found
> 2011-04-10 10:14:30,854 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Client calls
> recoverBlock(block=blk_1213779416283711358_54194,
> targets=[10.1.104.5:50010, 10.1.104.2:50010])
> 2011-04-10 10:14:30,856 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
> found for block blk_1213779416283711358_54194 on datanode
> 10.1.104.5:50010
> 2011-04-10 10:14:30,856 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
> found for block blk_1213779416283711358_54194 on datanode
> 10.1.104.2:50010
> 2011-04-10 10:14:30,857 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 2 on 50020, call recoverBlock(blk_1213779416283711358_54194,
> false, [Lorg.apache.hadoop.hdfs.protocol.DatanodeInfo;@62ee365a) from
> 10.1.104.1:34499: error: org.apache.hadoop.ipc.RemoteException:
> java.io.IOException: Block (=blk_1213779416283711358_54194) not found
> org.apache.hadoop.ipc.RemoteException: java.io.IOException: Block
> (=blk_1213779416283711358_54194) not found
> 2011-04-10 10:14:31,879 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Client calls
> recoverBlock(block=blk_1213779416283711358_54194,
> targets=[10.1.104.5:50010, 10.1.104.2:50010])
> 2011-04-10 10:14:31,882 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
> found for block blk_1213779416283711358_54194 on datanode
> 10.1.104.5:50010
> 2011-04-10 10:14:31,882 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
> found for block blk_1213779416283711358_54194 on datanode
> 10.1.104.2:50010
> 2011-04-10 10:14:31,883 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 0 on 50020, call recoverBlock(blk_1213779416283711358_54194,
> false, [Lorg.apache.hadoop.hdfs.protocol.DatanodeInfo;@ed5547f) from
> 10.1.104.1:34500: error: org.apache.hadoop.ipc.RemoteException:
> java.io.IOException: Block (=blk_1213779416283711358_54194) not found
> org.apache.hadoop.ipc.RemoteException: java.io.IOException: Block
> (=blk_1213779416283711358_54194) not found
> 2011-04-10 10:14:32,905 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Client calls
> recoverBlock(block=blk_1213779416283711358_54194,
> targets=[10.1.104.5:50010, 10.1.104.2:50010])
> 2011-04-10 10:14:32,907 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
> found for block blk_1213779416283711358_54194 on datanode
> 10.1.104.5:50010
> 2011-04-10 10:14:32,907 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
> found for block blk_1213779416283711358_54194 on datanode
> 10.1.104.2:50010
> 2011-04-10 10:14:32,908 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 1 on 50020, call recoverBlock(blk_1213779416283711358_54194,
> false, [Lorg.apache.hadoop.hdfs.protocol.DatanodeInfo;@2dfd9614) from
> 10.1.104.1:34511: error: org.apache.hadoop.ipc.RemoteException:
> java.io.IOException: Block (=blk_1213779416283711358_54194) not found
> org.apache.hadoop.ipc.RemoteException: java.io.IOException: Block
> (=blk_1213779416283711358_54194) not found
> 2011-04-10 10:14:33,935 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Client calls
> recoverBlock(block=blk_1213779416283711358_54194,
> targets=[10.1.104.5:50010, 10.1.104.2:50010])
> 2011-04-10 10:14:33,937 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
> found for block blk_1213779416283711358_54194 on datanode
> 10.1.104.5:50010
> 2011-04-10 10:14:33,937 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: No block metadata
> found for block blk_1213779416283711358_54194 on datanode
> 10.1.104.2:50010
> 2011-04-10 10:14:33,938 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 2 on 50020, call recoverBlock(blk_1213779416283711358_54194,
> false, [Lorg.apache.hadoop.hdfs.protocol.DatanodeInfo;@22fe47f3) from
> 10.1.104.1:34522: error: org.apache.hadoop.ipc.RemoteException:
> java.io.IOException: Block (=blk_1213779416283711358_54194) not found
> org.apache.hadoop.ipc.RemoteException: java.io.IOException: Block
> (=blk_1213779416283711358_54194) not found
>
>
>
> Running "hadoop fsck /" says the filesystem is healthy.
>
> The same kind of errors happened on all the region servers but for
> different blocks and against differnt datanodes, eventually the entire
> cluster shut itself down. This is the second time I'm having this
> problem while running load tests on the system. Previous time, once it
> started happening it would happen every hour or even without any load
> being run on the cluster, presumably during compaction. I wasn't sure
> if it was somehow my fault so I truncated the entire table which
> stopped the cluster crashes but now it's happening again.
>
> Any idea what's going on?
>
> -eran
>