You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Roberto Gonzalez <ro...@neclab.eu> on 2016/02/08 16:04:49 UTC

hadoop datanodes keep shuthing down with SIGTERM 15

Hi all,

I'm running a hadoop cluster with 24 servers. It has been running for some months, but after the last reboot the datanodes keep dying with the error:


2016-02-05 11:35:56,615 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40786, bytes: 118143861, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000330_0_-1595784897_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076219758_2486790, duration: 21719288540
2016-02-05 11:35:56,755 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40784, bytes: 118297616, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000231_0_-1089799971_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076221376_2488408, duration: 22149605332
2016-02-05 11:35:56,837 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40780, bytes: 118345914, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000208_0_-2005378882_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076231364_2498422, duration: 22460210591
2016-02-05 11:35:57,359 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40781, bytes: 118419792, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000184_0_406014429_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076221071_2488103, duration: 22978732747
2016-02-05 11:35:58,008 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40787, bytes: 118151696, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000324_0_-608122320_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076222362_2489394, duration: 23063230631
2016-02-05 11:36:00,295 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40776, bytes: 123206293, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000015_0_-846180274_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076244668_2511731, duration: 26044953281
2016-02-05 11:36:00,407 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40764, bytes: 123310419, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000010_0_-310980548_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076244751_2511814, duration: 26288883806
2016-02-05 11:36:01,371 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40783, bytes: 119653309, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000055_0_-558109635_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076222182_2489214, duration: 26808381782
2016-02-05 11:36:05,224 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: RECEIVED SIGNAL 15: SIGTERM
2016-02-05 11:36:05,230 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down DataNode at computer75/192.168.0.133
************************************************************/


every time I restart the cluster it starts well, with all the nodes on. but after some seconds running a map reduce job some nodes die with that error. Every time the dead nodes are different.

Do you have any idea of what is happening? I'm using Hadoop 2.4.1, and as I told, the cluster has been running before for months without problems.


I cannot find any error in the logs before it receives the SIGTERM.


Moreover, I tried using Spark and it seems to work (I analyze and save about 100Gb without problems), and the fsck report that the HDFS is ok. Nevetheless, in a normal map-reduce job the maps start failing (not all of them, some of them finish correctly).


Any idea on how to solve it?



Thanks.

Re: hadoop datanodes keep shuthing down with SIGTERM 15

Posted by Roberto Gonzalez <ro...@neclab.eu>.

Hi again.

I increased the log level to DEBUG, and I can see something more.

Now the datanodes (and resource nodes) die even without any application running some seconds after the nodes start. The output is:

2016-02-09 13:15:02,417 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: block=BP-2025286576-192.168.0.93-1414492170010:blk_1074656937_917463, replica=FinalizedReplica, blk_1074656937_917463, FINALIZED
  getNumBytes()     = 9041675
  getBytesOnDisk()  = 9041675
  getVisibleLength()= 9041675
  getVolume()       = /data/1/datanode/current
  getBlockFile()    = /data/1/datanode/current/BP-2025286576-192.168.0.93-1414492170010/current/finalized/subdir13/subdir246/blk_1074656937
  unlinked          =false
2016-02-09 13:15:02,427 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: replica=FinalizedReplica, blk_1074656937_917463, FINALIZED
  getNumBytes()     = 9041675
  getBytesOnDisk()  = 9041675
  getVisibleLength()= 9041675
  getVolume()       = /data/1/datanode/current
  getBlockFile()    = /data/1/datanode/current/BP-2025286576-192.168.0.93-1414492170010/current/finalized/subdir13/subdir246/blk_1074656937
  unlinked          =false
2016-02-09 13:15:03,406 DEBUG org.apache.hadoop.ipc.Server: IPC Server idle connection scanner for port 50020: task running
2016-02-09 13:15:03,880 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Sending heartbeat with 3 storage reports from service actor: Block pool BP-2025286576-192.168.0.93-1414492170010 (Datanode Uuid 61f22860-7943-48e6-90ba-a47cea0672e3) service to computer61.ant-net/192.168.0.93:8020
2016-02-09 13:15:03,880 DEBUG org.apache.hadoop.ipc.Client: IPC Client (1842979695) connection to computer61.ant-net/192.168.0.93:8020 from hadoopuser sending #1037
2016-02-09 13:15:03,882 DEBUG org.apache.hadoop.ipc.Client: closing ipc connection to computer61.ant-net/192.168.0.93:8020: null
java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:392)
    at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1084)
    at org.apache.hadoop.ipc.Client$Connection.run(Client.java:979)
2016-02-09 13:15:03,884 DEBUG org.apache.hadoop.ipc.Client: IPC Client (1842979695) connection to computer61.ant-net/192.168.0.93:8020 from hadoopuser: closed
2016-02-09 13:15:03,884 DEBUG org.apache.hadoop.ipc.Client: IPC Client (1842979695) connection to computer61.ant-net/192.168.0.93:8020 from hadoopuser: stopped, remaining connections 0
2016-02-09 13:15:03,888 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in offerService
java.io.EOFException: End of File Exception between local host is: "computer73.ant-net/192.168.0.131"; destination host is: "computer61.ant-net":8020; : java.io.EOFException; For more details see:  http://wiki.apache.org/hadoop/EOFException
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
    at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:765)
    at org.apache.hadoop.ipc.Client.call(Client.java:1479)
    at org.apache.hadoop.ipc.Client.call(Client.java:1412)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
    at com.sun.proxy.$Proxy14.sendHeartbeat(Unknown Source)
    at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:153)
    at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:554)
    at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:653)
    at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:824)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:392)
    at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1084)
    at org.apache.hadoop.ipc.Client$Connection.run(Client.java:979)
2016-02-09 13:15:06,880 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Sending heartbeat with 3 storage reports from service actor: Block pool BP-2025286576-192.168.0.93-1414492170010 (Datanode Uuid 61f22860-7943-48e6-90ba-a47cea0672e3) service to computer61.ant-net/192.168.0.93:8020
2016-02-09 13:15:06,880 DEBUG org.apache.hadoop.ipc.Client: The ping interval is 60000 ms.
2016-02-09 13:15:06,880 DEBUG org.apache.hadoop.ipc.Client: Connecting to computer61.ant-net/192.168.0.93:8020
2016-02-09 13:15:07,823 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: RECEIVED SIGNAL 15: SIGTERM
2016-02-09 13:15:07,826 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down DataNode at computer73.ant-net/192.168.0.131
************************************************************/


I've tried both with java8 and java7, compiling my own version of hadoop. But it continues failing. It seems that the connection with the namenode (computer61) fails, but I can connect with telnet to that port.

do you have any idea of what can be happenning or if I can try something else?

thanks!
Roberto







________________________________
De: Roberto Gonzalez [roberto.gonzalez@neclab.eu]
Enviado: lunes, 08 de febrero de 2016 16:04
Para: user@hadoop.apache.org
Asunto: hadoop datanodes keep shuthing down with SIGTERM 15

Hi all,

I'm running a hadoop cluster with 24 servers. It has been running for some months, but after the last reboot the datanodes keep dying with the error:


2016-02-05 11:35:56,615 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40786, bytes: 118143861, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000330_0_-1595784897_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076219758_2486790, duration: 21719288540
2016-02-05 11:35:56,755 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40784, bytes: 118297616, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000231_0_-1089799971_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076221376_2488408, duration: 22149605332
2016-02-05 11:35:56,837 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40780, bytes: 118345914, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000208_0_-2005378882_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076231364_2498422, duration: 22460210591
2016-02-05 11:35:57,359 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40781, bytes: 118419792, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000184_0_406014429_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076221071_2488103, duration: 22978732747
2016-02-05 11:35:58,008 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40787, bytes: 118151696, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000324_0_-608122320_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076222362_2489394, duration: 23063230631
2016-02-05 11:36:00,295 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40776, bytes: 123206293, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000015_0_-846180274_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076244668_2511731, duration: 26044953281
2016-02-05 11:36:00,407 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40764, bytes: 123310419, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000010_0_-310980548_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076244751_2511814, duration: 26288883806
2016-02-05 11:36:01,371 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40783, bytes: 119653309, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000055_0_-558109635_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076222182_2489214, duration: 26808381782
2016-02-05 11:36:05,224 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: RECEIVED SIGNAL 15: SIGTERM
2016-02-05 11:36:05,230 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down DataNode at computer75/192.168.0.133
************************************************************/


every time I restart the cluster it starts well, with all the nodes on. but after some seconds running a map reduce job some nodes die with that error. Every time the dead nodes are different.

Do you have any idea of what is happening? I'm using Hadoop 2.4.1, and as I told, the cluster has been running before for months without problems.


I cannot find any error in the logs before it receives the SIGTERM.


Moreover, I tried using Spark and it seems to work (I analyze and save about 100Gb without problems), and the fsck report that the HDFS is ok. Nevetheless, in a normal map-reduce job the maps start failing (not all of them, some of them finish correctly).


Any idea on how to solve it?



Thanks.

Re: hadoop datanodes keep shuthing down with SIGTERM 15

Posted by Roberto Gonzalez <ro...@neclab.eu>.

Hi again.

I increased the log level to DEBUG, and I can see something more.

Now the datanodes (and resource nodes) die even without any application running some seconds after the nodes start. The output is:

2016-02-09 13:15:02,417 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: block=BP-2025286576-192.168.0.93-1414492170010:blk_1074656937_917463, replica=FinalizedReplica, blk_1074656937_917463, FINALIZED
  getNumBytes()     = 9041675
  getBytesOnDisk()  = 9041675
  getVisibleLength()= 9041675
  getVolume()       = /data/1/datanode/current
  getBlockFile()    = /data/1/datanode/current/BP-2025286576-192.168.0.93-1414492170010/current/finalized/subdir13/subdir246/blk_1074656937
  unlinked          =false
2016-02-09 13:15:02,427 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: replica=FinalizedReplica, blk_1074656937_917463, FINALIZED
  getNumBytes()     = 9041675
  getBytesOnDisk()  = 9041675
  getVisibleLength()= 9041675
  getVolume()       = /data/1/datanode/current
  getBlockFile()    = /data/1/datanode/current/BP-2025286576-192.168.0.93-1414492170010/current/finalized/subdir13/subdir246/blk_1074656937
  unlinked          =false
2016-02-09 13:15:03,406 DEBUG org.apache.hadoop.ipc.Server: IPC Server idle connection scanner for port 50020: task running
2016-02-09 13:15:03,880 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Sending heartbeat with 3 storage reports from service actor: Block pool BP-2025286576-192.168.0.93-1414492170010 (Datanode Uuid 61f22860-7943-48e6-90ba-a47cea0672e3) service to computer61.ant-net/192.168.0.93:8020
2016-02-09 13:15:03,880 DEBUG org.apache.hadoop.ipc.Client: IPC Client (1842979695) connection to computer61.ant-net/192.168.0.93:8020 from hadoopuser sending #1037
2016-02-09 13:15:03,882 DEBUG org.apache.hadoop.ipc.Client: closing ipc connection to computer61.ant-net/192.168.0.93:8020: null
java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:392)
    at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1084)
    at org.apache.hadoop.ipc.Client$Connection.run(Client.java:979)
2016-02-09 13:15:03,884 DEBUG org.apache.hadoop.ipc.Client: IPC Client (1842979695) connection to computer61.ant-net/192.168.0.93:8020 from hadoopuser: closed
2016-02-09 13:15:03,884 DEBUG org.apache.hadoop.ipc.Client: IPC Client (1842979695) connection to computer61.ant-net/192.168.0.93:8020 from hadoopuser: stopped, remaining connections 0
2016-02-09 13:15:03,888 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in offerService
java.io.EOFException: End of File Exception between local host is: "computer73.ant-net/192.168.0.131"; destination host is: "computer61.ant-net":8020; : java.io.EOFException; For more details see:  http://wiki.apache.org/hadoop/EOFException
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
    at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:765)
    at org.apache.hadoop.ipc.Client.call(Client.java:1479)
    at org.apache.hadoop.ipc.Client.call(Client.java:1412)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
    at com.sun.proxy.$Proxy14.sendHeartbeat(Unknown Source)
    at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:153)
    at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:554)
    at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:653)
    at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:824)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:392)
    at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1084)
    at org.apache.hadoop.ipc.Client$Connection.run(Client.java:979)
2016-02-09 13:15:06,880 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Sending heartbeat with 3 storage reports from service actor: Block pool BP-2025286576-192.168.0.93-1414492170010 (Datanode Uuid 61f22860-7943-48e6-90ba-a47cea0672e3) service to computer61.ant-net/192.168.0.93:8020
2016-02-09 13:15:06,880 DEBUG org.apache.hadoop.ipc.Client: The ping interval is 60000 ms.
2016-02-09 13:15:06,880 DEBUG org.apache.hadoop.ipc.Client: Connecting to computer61.ant-net/192.168.0.93:8020
2016-02-09 13:15:07,823 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: RECEIVED SIGNAL 15: SIGTERM
2016-02-09 13:15:07,826 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down DataNode at computer73.ant-net/192.168.0.131
************************************************************/


I've tried both with java8 and java7, compiling my own version of hadoop. But it continues failing. It seems that the connection with the namenode (computer61) fails, but I can connect with telnet to that port.

do you have any idea of what can be happenning or if I can try something else?

thanks!
Roberto







________________________________
De: Roberto Gonzalez [roberto.gonzalez@neclab.eu]
Enviado: lunes, 08 de febrero de 2016 16:04
Para: user@hadoop.apache.org
Asunto: hadoop datanodes keep shuthing down with SIGTERM 15

Hi all,

I'm running a hadoop cluster with 24 servers. It has been running for some months, but after the last reboot the datanodes keep dying with the error:


2016-02-05 11:35:56,615 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40786, bytes: 118143861, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000330_0_-1595784897_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076219758_2486790, duration: 21719288540
2016-02-05 11:35:56,755 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40784, bytes: 118297616, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000231_0_-1089799971_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076221376_2488408, duration: 22149605332
2016-02-05 11:35:56,837 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40780, bytes: 118345914, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000208_0_-2005378882_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076231364_2498422, duration: 22460210591
2016-02-05 11:35:57,359 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40781, bytes: 118419792, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000184_0_406014429_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076221071_2488103, duration: 22978732747
2016-02-05 11:35:58,008 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40787, bytes: 118151696, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000324_0_-608122320_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076222362_2489394, duration: 23063230631
2016-02-05 11:36:00,295 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40776, bytes: 123206293, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000015_0_-846180274_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076244668_2511731, duration: 26044953281
2016-02-05 11:36:00,407 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40764, bytes: 123310419, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000010_0_-310980548_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076244751_2511814, duration: 26288883806
2016-02-05 11:36:01,371 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40783, bytes: 119653309, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000055_0_-558109635_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076222182_2489214, duration: 26808381782
2016-02-05 11:36:05,224 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: RECEIVED SIGNAL 15: SIGTERM
2016-02-05 11:36:05,230 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down DataNode at computer75/192.168.0.133
************************************************************/


every time I restart the cluster it starts well, with all the nodes on. but after some seconds running a map reduce job some nodes die with that error. Every time the dead nodes are different.

Do you have any idea of what is happening? I'm using Hadoop 2.4.1, and as I told, the cluster has been running before for months without problems.


I cannot find any error in the logs before it receives the SIGTERM.


Moreover, I tried using Spark and it seems to work (I analyze and save about 100Gb without problems), and the fsck report that the HDFS is ok. Nevetheless, in a normal map-reduce job the maps start failing (not all of them, some of them finish correctly).


Any idea on how to solve it?



Thanks.

Re: hadoop datanodes keep shuthing down with SIGTERM 15

Posted by Roberto Gonzalez <ro...@neclab.eu>.

Hi again.

I increased the log level to DEBUG, and I can see something more.

Now the datanodes (and resource nodes) die even without any application running some seconds after the nodes start. The output is:

2016-02-09 13:15:02,417 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: block=BP-2025286576-192.168.0.93-1414492170010:blk_1074656937_917463, replica=FinalizedReplica, blk_1074656937_917463, FINALIZED
  getNumBytes()     = 9041675
  getBytesOnDisk()  = 9041675
  getVisibleLength()= 9041675
  getVolume()       = /data/1/datanode/current
  getBlockFile()    = /data/1/datanode/current/BP-2025286576-192.168.0.93-1414492170010/current/finalized/subdir13/subdir246/blk_1074656937
  unlinked          =false
2016-02-09 13:15:02,427 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: replica=FinalizedReplica, blk_1074656937_917463, FINALIZED
  getNumBytes()     = 9041675
  getBytesOnDisk()  = 9041675
  getVisibleLength()= 9041675
  getVolume()       = /data/1/datanode/current
  getBlockFile()    = /data/1/datanode/current/BP-2025286576-192.168.0.93-1414492170010/current/finalized/subdir13/subdir246/blk_1074656937
  unlinked          =false
2016-02-09 13:15:03,406 DEBUG org.apache.hadoop.ipc.Server: IPC Server idle connection scanner for port 50020: task running
2016-02-09 13:15:03,880 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Sending heartbeat with 3 storage reports from service actor: Block pool BP-2025286576-192.168.0.93-1414492170010 (Datanode Uuid 61f22860-7943-48e6-90ba-a47cea0672e3) service to computer61.ant-net/192.168.0.93:8020
2016-02-09 13:15:03,880 DEBUG org.apache.hadoop.ipc.Client: IPC Client (1842979695) connection to computer61.ant-net/192.168.0.93:8020 from hadoopuser sending #1037
2016-02-09 13:15:03,882 DEBUG org.apache.hadoop.ipc.Client: closing ipc connection to computer61.ant-net/192.168.0.93:8020: null
java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:392)
    at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1084)
    at org.apache.hadoop.ipc.Client$Connection.run(Client.java:979)
2016-02-09 13:15:03,884 DEBUG org.apache.hadoop.ipc.Client: IPC Client (1842979695) connection to computer61.ant-net/192.168.0.93:8020 from hadoopuser: closed
2016-02-09 13:15:03,884 DEBUG org.apache.hadoop.ipc.Client: IPC Client (1842979695) connection to computer61.ant-net/192.168.0.93:8020 from hadoopuser: stopped, remaining connections 0
2016-02-09 13:15:03,888 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in offerService
java.io.EOFException: End of File Exception between local host is: "computer73.ant-net/192.168.0.131"; destination host is: "computer61.ant-net":8020; : java.io.EOFException; For more details see:  http://wiki.apache.org/hadoop/EOFException
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
    at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:765)
    at org.apache.hadoop.ipc.Client.call(Client.java:1479)
    at org.apache.hadoop.ipc.Client.call(Client.java:1412)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
    at com.sun.proxy.$Proxy14.sendHeartbeat(Unknown Source)
    at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:153)
    at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:554)
    at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:653)
    at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:824)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:392)
    at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1084)
    at org.apache.hadoop.ipc.Client$Connection.run(Client.java:979)
2016-02-09 13:15:06,880 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Sending heartbeat with 3 storage reports from service actor: Block pool BP-2025286576-192.168.0.93-1414492170010 (Datanode Uuid 61f22860-7943-48e6-90ba-a47cea0672e3) service to computer61.ant-net/192.168.0.93:8020
2016-02-09 13:15:06,880 DEBUG org.apache.hadoop.ipc.Client: The ping interval is 60000 ms.
2016-02-09 13:15:06,880 DEBUG org.apache.hadoop.ipc.Client: Connecting to computer61.ant-net/192.168.0.93:8020
2016-02-09 13:15:07,823 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: RECEIVED SIGNAL 15: SIGTERM
2016-02-09 13:15:07,826 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down DataNode at computer73.ant-net/192.168.0.131
************************************************************/


I've tried both with java8 and java7, compiling my own version of hadoop. But it continues failing. It seems that the connection with the namenode (computer61) fails, but I can connect with telnet to that port.

do you have any idea of what can be happenning or if I can try something else?

thanks!
Roberto







________________________________
De: Roberto Gonzalez [roberto.gonzalez@neclab.eu]
Enviado: lunes, 08 de febrero de 2016 16:04
Para: user@hadoop.apache.org
Asunto: hadoop datanodes keep shuthing down with SIGTERM 15

Hi all,

I'm running a hadoop cluster with 24 servers. It has been running for some months, but after the last reboot the datanodes keep dying with the error:


2016-02-05 11:35:56,615 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40786, bytes: 118143861, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000330_0_-1595784897_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076219758_2486790, duration: 21719288540
2016-02-05 11:35:56,755 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40784, bytes: 118297616, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000231_0_-1089799971_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076221376_2488408, duration: 22149605332
2016-02-05 11:35:56,837 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40780, bytes: 118345914, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000208_0_-2005378882_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076231364_2498422, duration: 22460210591
2016-02-05 11:35:57,359 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40781, bytes: 118419792, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000184_0_406014429_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076221071_2488103, duration: 22978732747
2016-02-05 11:35:58,008 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40787, bytes: 118151696, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000324_0_-608122320_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076222362_2489394, duration: 23063230631
2016-02-05 11:36:00,295 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40776, bytes: 123206293, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000015_0_-846180274_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076244668_2511731, duration: 26044953281
2016-02-05 11:36:00,407 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40764, bytes: 123310419, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000010_0_-310980548_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076244751_2511814, duration: 26288883806
2016-02-05 11:36:01,371 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40783, bytes: 119653309, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000055_0_-558109635_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076222182_2489214, duration: 26808381782
2016-02-05 11:36:05,224 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: RECEIVED SIGNAL 15: SIGTERM
2016-02-05 11:36:05,230 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down DataNode at computer75/192.168.0.133
************************************************************/


every time I restart the cluster it starts well, with all the nodes on. but after some seconds running a map reduce job some nodes die with that error. Every time the dead nodes are different.

Do you have any idea of what is happening? I'm using Hadoop 2.4.1, and as I told, the cluster has been running before for months without problems.


I cannot find any error in the logs before it receives the SIGTERM.


Moreover, I tried using Spark and it seems to work (I analyze and save about 100Gb without problems), and the fsck report that the HDFS is ok. Nevetheless, in a normal map-reduce job the maps start failing (not all of them, some of them finish correctly).


Any idea on how to solve it?



Thanks.

Re: hadoop datanodes keep shuthing down with SIGTERM 15

Posted by Roberto Gonzalez <ro...@neclab.eu>.

Hi again.

I increased the log level to DEBUG, and I can see something more.

Now the datanodes (and resource nodes) die even without any application running some seconds after the nodes start. The output is:

2016-02-09 13:15:02,417 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: block=BP-2025286576-192.168.0.93-1414492170010:blk_1074656937_917463, replica=FinalizedReplica, blk_1074656937_917463, FINALIZED
  getNumBytes()     = 9041675
  getBytesOnDisk()  = 9041675
  getVisibleLength()= 9041675
  getVolume()       = /data/1/datanode/current
  getBlockFile()    = /data/1/datanode/current/BP-2025286576-192.168.0.93-1414492170010/current/finalized/subdir13/subdir246/blk_1074656937
  unlinked          =false
2016-02-09 13:15:02,427 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: replica=FinalizedReplica, blk_1074656937_917463, FINALIZED
  getNumBytes()     = 9041675
  getBytesOnDisk()  = 9041675
  getVisibleLength()= 9041675
  getVolume()       = /data/1/datanode/current
  getBlockFile()    = /data/1/datanode/current/BP-2025286576-192.168.0.93-1414492170010/current/finalized/subdir13/subdir246/blk_1074656937
  unlinked          =false
2016-02-09 13:15:03,406 DEBUG org.apache.hadoop.ipc.Server: IPC Server idle connection scanner for port 50020: task running
2016-02-09 13:15:03,880 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Sending heartbeat with 3 storage reports from service actor: Block pool BP-2025286576-192.168.0.93-1414492170010 (Datanode Uuid 61f22860-7943-48e6-90ba-a47cea0672e3) service to computer61.ant-net/192.168.0.93:8020
2016-02-09 13:15:03,880 DEBUG org.apache.hadoop.ipc.Client: IPC Client (1842979695) connection to computer61.ant-net/192.168.0.93:8020 from hadoopuser sending #1037
2016-02-09 13:15:03,882 DEBUG org.apache.hadoop.ipc.Client: closing ipc connection to computer61.ant-net/192.168.0.93:8020: null
java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:392)
    at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1084)
    at org.apache.hadoop.ipc.Client$Connection.run(Client.java:979)
2016-02-09 13:15:03,884 DEBUG org.apache.hadoop.ipc.Client: IPC Client (1842979695) connection to computer61.ant-net/192.168.0.93:8020 from hadoopuser: closed
2016-02-09 13:15:03,884 DEBUG org.apache.hadoop.ipc.Client: IPC Client (1842979695) connection to computer61.ant-net/192.168.0.93:8020 from hadoopuser: stopped, remaining connections 0
2016-02-09 13:15:03,888 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in offerService
java.io.EOFException: End of File Exception between local host is: "computer73.ant-net/192.168.0.131"; destination host is: "computer61.ant-net":8020; : java.io.EOFException; For more details see:  http://wiki.apache.org/hadoop/EOFException
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
    at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:765)
    at org.apache.hadoop.ipc.Client.call(Client.java:1479)
    at org.apache.hadoop.ipc.Client.call(Client.java:1412)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
    at com.sun.proxy.$Proxy14.sendHeartbeat(Unknown Source)
    at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:153)
    at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:554)
    at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:653)
    at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:824)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:392)
    at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1084)
    at org.apache.hadoop.ipc.Client$Connection.run(Client.java:979)
2016-02-09 13:15:06,880 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Sending heartbeat with 3 storage reports from service actor: Block pool BP-2025286576-192.168.0.93-1414492170010 (Datanode Uuid 61f22860-7943-48e6-90ba-a47cea0672e3) service to computer61.ant-net/192.168.0.93:8020
2016-02-09 13:15:06,880 DEBUG org.apache.hadoop.ipc.Client: The ping interval is 60000 ms.
2016-02-09 13:15:06,880 DEBUG org.apache.hadoop.ipc.Client: Connecting to computer61.ant-net/192.168.0.93:8020
2016-02-09 13:15:07,823 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: RECEIVED SIGNAL 15: SIGTERM
2016-02-09 13:15:07,826 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down DataNode at computer73.ant-net/192.168.0.131
************************************************************/


I've tried both with java8 and java7, compiling my own version of hadoop. But it continues failing. It seems that the connection with the namenode (computer61) fails, but I can connect with telnet to that port.

do you have any idea of what can be happenning or if I can try something else?

thanks!
Roberto







________________________________
De: Roberto Gonzalez [roberto.gonzalez@neclab.eu]
Enviado: lunes, 08 de febrero de 2016 16:04
Para: user@hadoop.apache.org
Asunto: hadoop datanodes keep shuthing down with SIGTERM 15

Hi all,

I'm running a hadoop cluster with 24 servers. It has been running for some months, but after the last reboot the datanodes keep dying with the error:


2016-02-05 11:35:56,615 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40786, bytes: 118143861, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000330_0_-1595784897_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076219758_2486790, duration: 21719288540
2016-02-05 11:35:56,755 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40784, bytes: 118297616, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000231_0_-1089799971_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076221376_2488408, duration: 22149605332
2016-02-05 11:35:56,837 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40780, bytes: 118345914, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000208_0_-2005378882_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076231364_2498422, duration: 22460210591
2016-02-05 11:35:57,359 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40781, bytes: 118419792, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000184_0_406014429_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076221071_2488103, duration: 22978732747
2016-02-05 11:35:58,008 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40787, bytes: 118151696, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000324_0_-608122320_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076222362_2489394, duration: 23063230631
2016-02-05 11:36:00,295 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40776, bytes: 123206293, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000015_0_-846180274_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076244668_2511731, duration: 26044953281
2016-02-05 11:36:00,407 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40764, bytes: 123310419, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000010_0_-310980548_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076244751_2511814, duration: 26288883806
2016-02-05 11:36:01,371 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40783, bytes: 119653309, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000055_0_-558109635_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076222182_2489214, duration: 26808381782
2016-02-05 11:36:05,224 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: RECEIVED SIGNAL 15: SIGTERM
2016-02-05 11:36:05,230 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down DataNode at computer75/192.168.0.133
************************************************************/


every time I restart the cluster it starts well, with all the nodes on. but after some seconds running a map reduce job some nodes die with that error. Every time the dead nodes are different.

Do you have any idea of what is happening? I'm using Hadoop 2.4.1, and as I told, the cluster has been running before for months without problems.


I cannot find any error in the logs before it receives the SIGTERM.


Moreover, I tried using Spark and it seems to work (I analyze and save about 100Gb without problems), and the fsck report that the HDFS is ok. Nevetheless, in a normal map-reduce job the maps start failing (not all of them, some of them finish correctly).


Any idea on how to solve it?



Thanks.