You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Ja Sam <pt...@gmail.com> on 2015/06/24 11:24:54 UTC

Hadoop doesn't work after restart

I had a running Hadoop cluster (version 2.2.0.2.0.6.0-76 from Hortonworks).
Yesterday a lot of things happened nad in some point of time we decided to
one by one reboot all datanodes. Unfortunate the operator did monitor the
namenode health monitor.

The result of above operation is that all datanodes shows as dead nodes,
all blocked are lost, ... .

In one datanode which we decided to reboot it once again to see if datanode
will log anything interesting. The log finished with informations:

INFO  ipc.Server (Server.java:run(861)) - IPC Server Responder: starting
INFO  ipc.Server (Server.java:run(688)) - IPC Server listener on 8010: starting

and hangs here. In the same time on namnode I can see only two types of
messages:

INFO  hdfs.StateChange (FSNamesystem.java:completeFile(2805)) - DIR*
completeFile: [SOME PATH] is closed by
DFSClient_NONMAPREDUCE_288661168_33

and a lot of:

WARN  blockmanagement.BlockManager
(PendingReplicationBlocks.java:pendingReplicationCheck(249)) -
PendingReplicationMonitor timed out blk_1074405820_668233

Today we decided to restart name node and all data nodes. After restart
website: http://[server]:50070/dfshealth.jspanswers VERY slow. I don't see
any errors in log except 5 like bellow:

 ERROR datanode.DataNode (DataXceiver.java:run(225)) -
maelhd21:50010:DataXceiver error processing WRITE_BLOCK operation
src: /node1:33470 dest: /node3:50010

org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block
BP-1037132819-192.168.61.196-1409328081083:blk_1075994366_2257020 already
exists in state FINALIZED and thus cannot be created.

3 out of 5 nodes shows as lived, but refresh of hadoop status page takes
more than 10 minutes.

The question of course is: what should I check or do now?


p.s. I asked same question on StackOverflow:
http://stackoverflow.com/questions/31020877/datanodes-are-cannot-connect-to-namenode

Re: Hadoop doesn't work after restart

Posted by hadoop hive <ha...@gmail.com>.

Try running fsck

On Wed, Jun 24, 2015 at 2:54 PM, Ja Sam <pt...@gmail.com> wrote:

> I had a running Hadoop cluster (version 2.2.0.2.0.6.0-76 from
> Hortonworks). Yesterday a lot of things happened nad in some point of time
> we decided to one by one reboot all datanodes. Unfortunate the operator did
> monitor the namenode health monitor.
>
> The result of above operation is that all datanodes shows as dead nodes,
> all blocked are lost, ... .
>
> In one datanode which we decided to reboot it once again to see if
> datanode will log anything interesting. The log finished with informations:
>
> INFO  ipc.Server (Server.java:run(861)) - IPC Server Responder: starting
> INFO  ipc.Server (Server.java:run(688)) - IPC Server listener on 8010: starting
>
> and hangs here. In the same time on namnode I can see only two types of
> messages:
>
> INFO  hdfs.StateChange (FSNamesystem.java:completeFile(2805)) - DIR* completeFile: [SOME PATH] is closed by DFSClient_NONMAPREDUCE_288661168_33
>
> and a lot of:
>
> WARN  blockmanagement.BlockManager (PendingReplicationBlocks.java:pendingReplicationCheck(249)) - PendingReplicationMonitor timed out blk_1074405820_668233
>
> Today we decided to restart name node and all data nodes. After restart
> website: http://[server]:50070/dfshealth.jspanswers VERY slow. I don't
> see any errors in log except 5 like bellow:
>
>  ERROR datanode.DataNode (DataXceiver.java:run(225)) - maelhd21:50010:DataXceiver error processing WRITE_BLOCK operation  src: /node1:33470 dest: /node3:50010
>
> org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException:
> Block BP-1037132819-192.168.61.196-1409328081083:blk_1075994366_2257020
> already exists in state FINALIZED and thus cannot be created.
>
> 3 out of 5 nodes shows as lived, but refresh of hadoop status page takes
> more than 10 minutes.
>
> The question of course is: what should I check or do now?
>
>
> p.s. I asked same question on StackOverflow:
> http://stackoverflow.com/questions/31020877/datanodes-are-cannot-connect-to-namenode
>

Re: Hadoop doesn't work after restart

Posted by hadoop hive <ha...@gmail.com>.

Try running fsck

On Wed, Jun 24, 2015 at 2:54 PM, Ja Sam <pt...@gmail.com> wrote:

> I had a running Hadoop cluster (version 2.2.0.2.0.6.0-76 from
> Hortonworks). Yesterday a lot of things happened nad in some point of time
> we decided to one by one reboot all datanodes. Unfortunate the operator did
> monitor the namenode health monitor.
>
> The result of above operation is that all datanodes shows as dead nodes,
> all blocked are lost, ... .
>
> In one datanode which we decided to reboot it once again to see if
> datanode will log anything interesting. The log finished with informations:
>
> INFO  ipc.Server (Server.java:run(861)) - IPC Server Responder: starting
> INFO  ipc.Server (Server.java:run(688)) - IPC Server listener on 8010: starting
>
> and hangs here. In the same time on namnode I can see only two types of
> messages:
>
> INFO  hdfs.StateChange (FSNamesystem.java:completeFile(2805)) - DIR* completeFile: [SOME PATH] is closed by DFSClient_NONMAPREDUCE_288661168_33
>
> and a lot of:
>
> WARN  blockmanagement.BlockManager (PendingReplicationBlocks.java:pendingReplicationCheck(249)) - PendingReplicationMonitor timed out blk_1074405820_668233
>
> Today we decided to restart name node and all data nodes. After restart
> website: http://[server]:50070/dfshealth.jspanswers VERY slow. I don't
> see any errors in log except 5 like bellow:
>
>  ERROR datanode.DataNode (DataXceiver.java:run(225)) - maelhd21:50010:DataXceiver error processing WRITE_BLOCK operation  src: /node1:33470 dest: /node3:50010
>
> org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException:
> Block BP-1037132819-192.168.61.196-1409328081083:blk_1075994366_2257020
> already exists in state FINALIZED and thus cannot be created.
>
> 3 out of 5 nodes shows as lived, but refresh of hadoop status page takes
> more than 10 minutes.
>
> The question of course is: what should I check or do now?
>
>
> p.s. I asked same question on StackOverflow:
> http://stackoverflow.com/questions/31020877/datanodes-are-cannot-connect-to-namenode
>

Re: Hadoop doesn't work after restart

Posted by hadoop hive <ha...@gmail.com>.

Try running fsck

On Wed, Jun 24, 2015 at 2:54 PM, Ja Sam <pt...@gmail.com> wrote:

> I had a running Hadoop cluster (version 2.2.0.2.0.6.0-76 from
> Hortonworks). Yesterday a lot of things happened nad in some point of time
> we decided to one by one reboot all datanodes. Unfortunate the operator did
> monitor the namenode health monitor.
>
> The result of above operation is that all datanodes shows as dead nodes,
> all blocked are lost, ... .
>
> In one datanode which we decided to reboot it once again to see if
> datanode will log anything interesting. The log finished with informations:
>
> INFO  ipc.Server (Server.java:run(861)) - IPC Server Responder: starting
> INFO  ipc.Server (Server.java:run(688)) - IPC Server listener on 8010: starting
>
> and hangs here. In the same time on namnode I can see only two types of
> messages:
>
> INFO  hdfs.StateChange (FSNamesystem.java:completeFile(2805)) - DIR* completeFile: [SOME PATH] is closed by DFSClient_NONMAPREDUCE_288661168_33
>
> and a lot of:
>
> WARN  blockmanagement.BlockManager (PendingReplicationBlocks.java:pendingReplicationCheck(249)) - PendingReplicationMonitor timed out blk_1074405820_668233
>
> Today we decided to restart name node and all data nodes. After restart
> website: http://[server]:50070/dfshealth.jspanswers VERY slow. I don't
> see any errors in log except 5 like bellow:
>
>  ERROR datanode.DataNode (DataXceiver.java:run(225)) - maelhd21:50010:DataXceiver error processing WRITE_BLOCK operation  src: /node1:33470 dest: /node3:50010
>
> org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException:
> Block BP-1037132819-192.168.61.196-1409328081083:blk_1075994366_2257020
> already exists in state FINALIZED and thus cannot be created.
>
> 3 out of 5 nodes shows as lived, but refresh of hadoop status page takes
> more than 10 minutes.
>
> The question of course is: what should I check or do now?
>
>
> p.s. I asked same question on StackOverflow:
> http://stackoverflow.com/questions/31020877/datanodes-are-cannot-connect-to-namenode
>

Re: Hadoop doesn't work after restart

Posted by hadoop hive <ha...@gmail.com>.

Try running fsck

On Wed, Jun 24, 2015 at 2:54 PM, Ja Sam <pt...@gmail.com> wrote:

> I had a running Hadoop cluster (version 2.2.0.2.0.6.0-76 from
> Hortonworks). Yesterday a lot of things happened nad in some point of time
> we decided to one by one reboot all datanodes. Unfortunate the operator did
> monitor the namenode health monitor.
>
> The result of above operation is that all datanodes shows as dead nodes,
> all blocked are lost, ... .
>
> In one datanode which we decided to reboot it once again to see if
> datanode will log anything interesting. The log finished with informations:
>
> INFO  ipc.Server (Server.java:run(861)) - IPC Server Responder: starting
> INFO  ipc.Server (Server.java:run(688)) - IPC Server listener on 8010: starting
>
> and hangs here. In the same time on namnode I can see only two types of
> messages:
>
> INFO  hdfs.StateChange (FSNamesystem.java:completeFile(2805)) - DIR* completeFile: [SOME PATH] is closed by DFSClient_NONMAPREDUCE_288661168_33
>
> and a lot of:
>
> WARN  blockmanagement.BlockManager (PendingReplicationBlocks.java:pendingReplicationCheck(249)) - PendingReplicationMonitor timed out blk_1074405820_668233
>
> Today we decided to restart name node and all data nodes. After restart
> website: http://[server]:50070/dfshealth.jspanswers VERY slow. I don't
> see any errors in log except 5 like bellow:
>
>  ERROR datanode.DataNode (DataXceiver.java:run(225)) - maelhd21:50010:DataXceiver error processing WRITE_BLOCK operation  src: /node1:33470 dest: /node3:50010
>
> org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException:
> Block BP-1037132819-192.168.61.196-1409328081083:blk_1075994366_2257020
> already exists in state FINALIZED and thus cannot be created.
>
> 3 out of 5 nodes shows as lived, but refresh of hadoop status page takes
> more than 10 minutes.
>
> The question of course is: what should I check or do now?
>
>
> p.s. I asked same question on StackOverflow:
> http://stackoverflow.com/questions/31020877/datanodes-are-cannot-connect-to-namenode
>