You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Takahiko Kawasaki <da...@gmail.com> on 2012/10/30 12:10:44 UTC

DataNodes fail to send heartbeat to HA-enabled NameNode

Hello,

I have trouble in quorum-based HDFS HA of CDH 4.1.1.

NameNode Web UI of Cloudera Manager reports NameNode status.
Its has "Cluster Summary" section and my cluster is summarized
there like below.

--- Cluster Summary ---
Configured Capacity   : 0 KB
DFS Used              : 0 KB
Non DFS Used          : 0 KB
DFS Remaining         : 0 KB
DFS Used%             : 100 %
DFS Remaining%        : 0 %
Block Pool Used       : 0 KB
Block Pool Used%      : 100 %
DataNodes usages      : Min %  Median %  Max %  stdev %
                          0 %       0 %    0 %      0 %
Live Nodes            : 0 (Decommissioned: 0)
Dead Nodes            : 5 (Decommissioned: 0)
Decommissioning Nodes : 0
--------------------

As you can see, all the DataNodes are regarded as dead.

I found DataNodes continued to emit logs about failure to
send heartbeat to NameNode.

---- DataNode Log (host names were manually edited) ---
2012-10-30 19:28:16,817 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: For namenode
node02.example.com/192.168.62.232:8020 using DELETEREPORT_INTERVAL of
300000 msec  BLOCKREPORT_INTERVAL of 21600000msec Initial delay:
0msec; heartBeatInterval=3000
2012-10-30 19:28:16,817 ERROR
org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
BPOfferService for Block pool
BP-2063217961-192.168.62.231-1351263110470 (storage id
DS-2090122187-192.168.62.233-50010-1338981658216) service to
node02.example.com/192.168.62.232:8020
java.lang.NullPointerException
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:435)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:521)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:674)
        at java.lang.Thread.run(Thread.java:662)
--------------------

So, I guess that DataNodes are failing to locate the name service
for some reasons, but I don't have any clue to solve the problem.

I confirmed that
/var/run/cloudera-scm-agent/process/???-hdfs-DATANODE/core-site.xml
of a DataNode contains

--- core-site.xml ---
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://nameservice1</value>
  </property>
--------------------

and hdfs-site.xml contains

--- hdfs-site.xml ---
  <property>
    <name>dfs.nameservices</name>
    <value>nameservice1</value>
  </property>
  <property>
    <name>dfs.client.failover.proxy.provider.nameservice1</name>
    <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
  </property>
  <property>
    <name>dfs.ha.namenodes.nameservice1</name>
    <value>namenode38,namenode90</value>
  </property>
  <property>
    <name>dfs.namenode.rpc-address.nameservice1.namenode38</name>
    <value>node01.example.com:8020</value>
  </property>
  <property>
    <name>dfs.namenode.http-address.nameservice1.namenode38</name>
    <value>node01.example.com:50070</value>
  </property>
  <property>
    <name>dfs.namenode.https-address.nameservice1.namenode38</name>
    <value>node01.example.com:50470</value>
  </property>
  <property>
    <name>dfs.namenode.rpc-address.nameservice1.namenode90</name>
    <value>node02.example.com:8020</value>
  </property>
  <property>
    <name>dfs.namenode.http-address.nameservice1.namenode90</name>
    <value>node02.example.com:50070</value>
  </property>
  <property>
    <name>dfs.namenode.https-address.nameservice1.namenode90</name>
    <value>jbmnode02.jibemobile.jp:50470</value>
  </property>
  <property>
    <name>dfs.permissions.superusergroup</name>
    <value>supergroup</value>
  </property>
  <property>
    <name>dfs.replication</name>
    <value>3</value>
  </property>
  <property>
    <name>dfs.namenode.replication.min</name>
    <value>1</value>
  </property>
  <property>
    <name>dfs.replication.max</name>
    <value>512</value>
  </property>
--------------------

The following was my trial to create a file in HDFS but in vain.

--------------------
# vi /tmp/test.txt
# sudo -u hdfs hadoop fs -mkdir /takahiko
# sudo -u hdfs hadoop fs -ls /
Found 3 items
drwxr-xr-x   - hbase hbase               0 2012-10-30 15:12 /hbase
drwxr-xr-x   - hdfs  supergroup          0 2012-10-30 18:55 /takahiko
drwxrwxrwt   - hdfs  hdfs                0 2012-10-26 23:58 /tmp
# sudo -u hdfs hadoop fs -copyFromLocal /tmp/test.txt /takahiko/
12/10/30 20:07:05 WARN hdfs.DFSClient: DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
/takahiko/test.txt._COPYING_ could only be replicated to 0 nodes
instead of minReplication (=1).  There are 0 datanode(s) running and
no node(s) are excluded in this operation.
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1322)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2170)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:471)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:898)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1693)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1689)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1687)

        at org.apache.hadoop.ipc.Client.call(Client.java:1160)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
        at $Proxy9.addBlock(Unknown Source)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:290)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
        at $Proxy10.addBlock(Unknown Source)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1150)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1003)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:463)
copyFromLocal: File /takahiko/test.txt._COPYING_ could only be
replicated to 0 nodes instead of minReplication (=1).  There are 0
datanode(s) running and no node(s) are excluded in this operation.
12/10/30 20:07:05 ERROR hdfs.DFSClient: Failed to close file
/takahiko/test.txt._COPYING_
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
/takahiko/test.txt._COPYING_ could only be replicated to 0 nodes
instead of minReplication (=1).  There are 0 datanode(s) running and
no node(s) are excluded in this operation.
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1322)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2170)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:471)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:898)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1693)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1689)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1687)

        at org.apache.hadoop.ipc.Client.call(Client.java:1160)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
        at $Proxy9.addBlock(Unknown Source)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:290)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
        at $Proxy10.addBlock(Unknown Source)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1150)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1003)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:463)
--------------------


Could anyone give me any hint to solve the problem?

Best Regards,
Takahiko Kawasaki

Re: DataNodes fail to send heartbeat to HA-enabled NameNode

Posted by Todd Lipcon <to...@cloudera.com>.
BTW, I forgot that I did file a ticket a while back on a related issue:
https://issues.apache.org/jira/browse/hdfs-2882

My assumption is that, higher up in the logs, you will find an underlying
issue which caused NPEs later.

-Todd

On Tue, Oct 30, 2012 at 11:23 AM, Todd Lipcon <to...@cloudera.com> wrote:

> Hi Takahiko,
>
> Can you please provide the full datanode log up to the point where you
> first see an NPE?
>
> FWIW, this error has nothing to do with the new QuorumJournalManager
> feature -- I've seen this bug once or twice over the last couple years but
> never been able to reproduce it reliably.
>
> -Todd
>
>
> On Tue, Oct 30, 2012 at 10:06 AM, Steve Loughran <st...@hortonworks.com>wrote:
>
>>
>>
>> On 30 October 2012 11:10, Takahiko Kawasaki <da...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I have trouble in quorum-based HDFS HA of CDH 4.1.1.
>>>
>>> 2012-10-30 19:28:16,817 ERROR
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
>>> BPOfferService for Block pool
>>> BP-2063217961-192.168.62.231-1351263110470 (storage id
>>> DS-2090122187-192.168.62.233-50010-1338981658216) service to
>>> node02.example.com/192.168.62.232:8020
>>> java.lang.NullPointerException
>>>         at
>>> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:435)
>>>         at
>>> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:521)
>>>         at
>>> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:674)
>>>         at java.lang.Thread.run(Thread.java:662)
>>> --------------------
>>>
>>>
>> look like you've been the first person to find an issue in some code that
>> is very, very fresh.
>>
>> File a bug report on JIRA; try to replicate it on the latest apache alpha
>> release if you can.
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: DataNodes fail to send heartbeat to HA-enabled NameNode

Posted by Todd Lipcon <to...@cloudera.com>.
BTW, I forgot that I did file a ticket a while back on a related issue:
https://issues.apache.org/jira/browse/hdfs-2882

My assumption is that, higher up in the logs, you will find an underlying
issue which caused NPEs later.

-Todd

On Tue, Oct 30, 2012 at 11:23 AM, Todd Lipcon <to...@cloudera.com> wrote:

> Hi Takahiko,
>
> Can you please provide the full datanode log up to the point where you
> first see an NPE?
>
> FWIW, this error has nothing to do with the new QuorumJournalManager
> feature -- I've seen this bug once or twice over the last couple years but
> never been able to reproduce it reliably.
>
> -Todd
>
>
> On Tue, Oct 30, 2012 at 10:06 AM, Steve Loughran <st...@hortonworks.com>wrote:
>
>>
>>
>> On 30 October 2012 11:10, Takahiko Kawasaki <da...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I have trouble in quorum-based HDFS HA of CDH 4.1.1.
>>>
>>> 2012-10-30 19:28:16,817 ERROR
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
>>> BPOfferService for Block pool
>>> BP-2063217961-192.168.62.231-1351263110470 (storage id
>>> DS-2090122187-192.168.62.233-50010-1338981658216) service to
>>> node02.example.com/192.168.62.232:8020
>>> java.lang.NullPointerException
>>>         at
>>> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:435)
>>>         at
>>> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:521)
>>>         at
>>> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:674)
>>>         at java.lang.Thread.run(Thread.java:662)
>>> --------------------
>>>
>>>
>> look like you've been the first person to find an issue in some code that
>> is very, very fresh.
>>
>> File a bug report on JIRA; try to replicate it on the latest apache alpha
>> release if you can.
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: DataNodes fail to send heartbeat to HA-enabled NameNode

Posted by Todd Lipcon <to...@cloudera.com>.
BTW, I forgot that I did file a ticket a while back on a related issue:
https://issues.apache.org/jira/browse/hdfs-2882

My assumption is that, higher up in the logs, you will find an underlying
issue which caused NPEs later.

-Todd

On Tue, Oct 30, 2012 at 11:23 AM, Todd Lipcon <to...@cloudera.com> wrote:

> Hi Takahiko,
>
> Can you please provide the full datanode log up to the point where you
> first see an NPE?
>
> FWIW, this error has nothing to do with the new QuorumJournalManager
> feature -- I've seen this bug once or twice over the last couple years but
> never been able to reproduce it reliably.
>
> -Todd
>
>
> On Tue, Oct 30, 2012 at 10:06 AM, Steve Loughran <st...@hortonworks.com>wrote:
>
>>
>>
>> On 30 October 2012 11:10, Takahiko Kawasaki <da...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I have trouble in quorum-based HDFS HA of CDH 4.1.1.
>>>
>>> 2012-10-30 19:28:16,817 ERROR
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
>>> BPOfferService for Block pool
>>> BP-2063217961-192.168.62.231-1351263110470 (storage id
>>> DS-2090122187-192.168.62.233-50010-1338981658216) service to
>>> node02.example.com/192.168.62.232:8020
>>> java.lang.NullPointerException
>>>         at
>>> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:435)
>>>         at
>>> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:521)
>>>         at
>>> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:674)
>>>         at java.lang.Thread.run(Thread.java:662)
>>> --------------------
>>>
>>>
>> look like you've been the first person to find an issue in some code that
>> is very, very fresh.
>>
>> File a bug report on JIRA; try to replicate it on the latest apache alpha
>> release if you can.
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: DataNodes fail to send heartbeat to HA-enabled NameNode

Posted by Todd Lipcon <to...@cloudera.com>.
BTW, I forgot that I did file a ticket a while back on a related issue:
https://issues.apache.org/jira/browse/hdfs-2882

My assumption is that, higher up in the logs, you will find an underlying
issue which caused NPEs later.

-Todd

On Tue, Oct 30, 2012 at 11:23 AM, Todd Lipcon <to...@cloudera.com> wrote:

> Hi Takahiko,
>
> Can you please provide the full datanode log up to the point where you
> first see an NPE?
>
> FWIW, this error has nothing to do with the new QuorumJournalManager
> feature -- I've seen this bug once or twice over the last couple years but
> never been able to reproduce it reliably.
>
> -Todd
>
>
> On Tue, Oct 30, 2012 at 10:06 AM, Steve Loughran <st...@hortonworks.com>wrote:
>
>>
>>
>> On 30 October 2012 11:10, Takahiko Kawasaki <da...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I have trouble in quorum-based HDFS HA of CDH 4.1.1.
>>>
>>> 2012-10-30 19:28:16,817 ERROR
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
>>> BPOfferService for Block pool
>>> BP-2063217961-192.168.62.231-1351263110470 (storage id
>>> DS-2090122187-192.168.62.233-50010-1338981658216) service to
>>> node02.example.com/192.168.62.232:8020
>>> java.lang.NullPointerException
>>>         at
>>> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:435)
>>>         at
>>> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:521)
>>>         at
>>> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:674)
>>>         at java.lang.Thread.run(Thread.java:662)
>>> --------------------
>>>
>>>
>> look like you've been the first person to find an issue in some code that
>> is very, very fresh.
>>
>> File a bug report on JIRA; try to replicate it on the latest apache alpha
>> release if you can.
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: DataNodes fail to send heartbeat to HA-enabled NameNode

Posted by Todd Lipcon <to...@cloudera.com>.
Hi Takahiko,

Can you please provide the full datanode log up to the point where you
first see an NPE?

FWIW, this error has nothing to do with the new QuorumJournalManager
feature -- I've seen this bug once or twice over the last couple years but
never been able to reproduce it reliably.

-Todd

On Tue, Oct 30, 2012 at 10:06 AM, Steve Loughran <st...@hortonworks.com>wrote:

>
>
> On 30 October 2012 11:10, Takahiko Kawasaki <da...@gmail.com> wrote:
>
>> Hello,
>>
>> I have trouble in quorum-based HDFS HA of CDH 4.1.1.
>>
>> 2012-10-30 19:28:16,817 ERROR
>> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
>> BPOfferService for Block pool
>> BP-2063217961-192.168.62.231-1351263110470 (storage id
>> DS-2090122187-192.168.62.233-50010-1338981658216) service to
>> node02.example.com/192.168.62.232:8020
>> java.lang.NullPointerException
>>         at
>> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:435)
>>         at
>> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:521)
>>         at
>> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:674)
>>         at java.lang.Thread.run(Thread.java:662)
>> --------------------
>>
>>
> look like you've been the first person to find an issue in some code that
> is very, very fresh.
>
> File a bug report on JIRA; try to replicate it on the latest apache alpha
> release if you can.
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: DataNodes fail to send heartbeat to HA-enabled NameNode

Posted by Todd Lipcon <to...@cloudera.com>.
Hi Takahiko,

Can you please provide the full datanode log up to the point where you
first see an NPE?

FWIW, this error has nothing to do with the new QuorumJournalManager
feature -- I've seen this bug once or twice over the last couple years but
never been able to reproduce it reliably.

-Todd

On Tue, Oct 30, 2012 at 10:06 AM, Steve Loughran <st...@hortonworks.com>wrote:

>
>
> On 30 October 2012 11:10, Takahiko Kawasaki <da...@gmail.com> wrote:
>
>> Hello,
>>
>> I have trouble in quorum-based HDFS HA of CDH 4.1.1.
>>
>> 2012-10-30 19:28:16,817 ERROR
>> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
>> BPOfferService for Block pool
>> BP-2063217961-192.168.62.231-1351263110470 (storage id
>> DS-2090122187-192.168.62.233-50010-1338981658216) service to
>> node02.example.com/192.168.62.232:8020
>> java.lang.NullPointerException
>>         at
>> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:435)
>>         at
>> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:521)
>>         at
>> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:674)
>>         at java.lang.Thread.run(Thread.java:662)
>> --------------------
>>
>>
> look like you've been the first person to find an issue in some code that
> is very, very fresh.
>
> File a bug report on JIRA; try to replicate it on the latest apache alpha
> release if you can.
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: DataNodes fail to send heartbeat to HA-enabled NameNode

Posted by Todd Lipcon <to...@cloudera.com>.
Hi Takahiko,

Can you please provide the full datanode log up to the point where you
first see an NPE?

FWIW, this error has nothing to do with the new QuorumJournalManager
feature -- I've seen this bug once or twice over the last couple years but
never been able to reproduce it reliably.

-Todd

On Tue, Oct 30, 2012 at 10:06 AM, Steve Loughran <st...@hortonworks.com>wrote:

>
>
> On 30 October 2012 11:10, Takahiko Kawasaki <da...@gmail.com> wrote:
>
>> Hello,
>>
>> I have trouble in quorum-based HDFS HA of CDH 4.1.1.
>>
>> 2012-10-30 19:28:16,817 ERROR
>> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
>> BPOfferService for Block pool
>> BP-2063217961-192.168.62.231-1351263110470 (storage id
>> DS-2090122187-192.168.62.233-50010-1338981658216) service to
>> node02.example.com/192.168.62.232:8020
>> java.lang.NullPointerException
>>         at
>> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:435)
>>         at
>> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:521)
>>         at
>> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:674)
>>         at java.lang.Thread.run(Thread.java:662)
>> --------------------
>>
>>
> look like you've been the first person to find an issue in some code that
> is very, very fresh.
>
> File a bug report on JIRA; try to replicate it on the latest apache alpha
> release if you can.
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: DataNodes fail to send heartbeat to HA-enabled NameNode

Posted by Todd Lipcon <to...@cloudera.com>.
Hi Takahiko,

Can you please provide the full datanode log up to the point where you
first see an NPE?

FWIW, this error has nothing to do with the new QuorumJournalManager
feature -- I've seen this bug once or twice over the last couple years but
never been able to reproduce it reliably.

-Todd

On Tue, Oct 30, 2012 at 10:06 AM, Steve Loughran <st...@hortonworks.com>wrote:

>
>
> On 30 October 2012 11:10, Takahiko Kawasaki <da...@gmail.com> wrote:
>
>> Hello,
>>
>> I have trouble in quorum-based HDFS HA of CDH 4.1.1.
>>
>> 2012-10-30 19:28:16,817 ERROR
>> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
>> BPOfferService for Block pool
>> BP-2063217961-192.168.62.231-1351263110470 (storage id
>> DS-2090122187-192.168.62.233-50010-1338981658216) service to
>> node02.example.com/192.168.62.232:8020
>> java.lang.NullPointerException
>>         at
>> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:435)
>>         at
>> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:521)
>>         at
>> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:674)
>>         at java.lang.Thread.run(Thread.java:662)
>> --------------------
>>
>>
> look like you've been the first person to find an issue in some code that
> is very, very fresh.
>
> File a bug report on JIRA; try to replicate it on the latest apache alpha
> release if you can.
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: DataNodes fail to send heartbeat to HA-enabled NameNode

Posted by Steve Loughran <st...@hortonworks.com>.
On 30 October 2012 11:10, Takahiko Kawasaki <da...@gmail.com> wrote:

> Hello,
>
> I have trouble in quorum-based HDFS HA of CDH 4.1.1.
>
> 2012-10-30 19:28:16,817 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
> BPOfferService for Block pool
> BP-2063217961-192.168.62.231-1351263110470 (storage id
> DS-2090122187-192.168.62.233-50010-1338981658216) service to
> node02.example.com/192.168.62.232:8020
> java.lang.NullPointerException
>         at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:435)
>         at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:521)
>         at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:674)
>         at java.lang.Thread.run(Thread.java:662)
> --------------------
>
>
look like you've been the first person to find an issue in some code that
is very, very fresh.

File a bug report on JIRA; try to replicate it on the latest apache alpha
release if you can.

Re: DataNodes fail to send heartbeat to HA-enabled NameNode

Posted by Steve Loughran <st...@hortonworks.com>.
On 30 October 2012 11:10, Takahiko Kawasaki <da...@gmail.com> wrote:

> Hello,
>
> I have trouble in quorum-based HDFS HA of CDH 4.1.1.
>
> 2012-10-30 19:28:16,817 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
> BPOfferService for Block pool
> BP-2063217961-192.168.62.231-1351263110470 (storage id
> DS-2090122187-192.168.62.233-50010-1338981658216) service to
> node02.example.com/192.168.62.232:8020
> java.lang.NullPointerException
>         at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:435)
>         at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:521)
>         at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:674)
>         at java.lang.Thread.run(Thread.java:662)
> --------------------
>
>
look like you've been the first person to find an issue in some code that
is very, very fresh.

File a bug report on JIRA; try to replicate it on the latest apache alpha
release if you can.

Re: DataNodes fail to send heartbeat to HA-enabled NameNode

Posted by Steve Loughran <st...@hortonworks.com>.
On 30 October 2012 11:10, Takahiko Kawasaki <da...@gmail.com> wrote:

> Hello,
>
> I have trouble in quorum-based HDFS HA of CDH 4.1.1.
>
> 2012-10-30 19:28:16,817 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
> BPOfferService for Block pool
> BP-2063217961-192.168.62.231-1351263110470 (storage id
> DS-2090122187-192.168.62.233-50010-1338981658216) service to
> node02.example.com/192.168.62.232:8020
> java.lang.NullPointerException
>         at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:435)
>         at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:521)
>         at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:674)
>         at java.lang.Thread.run(Thread.java:662)
> --------------------
>
>
look like you've been the first person to find an issue in some code that
is very, very fresh.

File a bug report on JIRA; try to replicate it on the latest apache alpha
release if you can.

Re: DataNodes fail to send heartbeat to HA-enabled NameNode

Posted by Steve Loughran <st...@hortonworks.com>.
On 30 October 2012 11:10, Takahiko Kawasaki <da...@gmail.com> wrote:

> Hello,
>
> I have trouble in quorum-based HDFS HA of CDH 4.1.1.
>
> 2012-10-30 19:28:16,817 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
> BPOfferService for Block pool
> BP-2063217961-192.168.62.231-1351263110470 (storage id
> DS-2090122187-192.168.62.233-50010-1338981658216) service to
> node02.example.com/192.168.62.232:8020
> java.lang.NullPointerException
>         at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:435)
>         at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:521)
>         at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:674)
>         at java.lang.Thread.run(Thread.java:662)
> --------------------
>
>
look like you've been the first person to find an issue in some code that
is very, very fresh.

File a bug report on JIRA; try to replicate it on the latest apache alpha
release if you can.

Re: DataNodes fail to send heartbeat to HA-enabled NameNode

Posted by Harsh J <ha...@cloudera.com>.
Moving to cdh-user@cloudera.org
(https://groups.google.com/a/cloudera.org/forum/#!forum/cdh-user), as
it may be a CDH4 specific problem.

Could you share your whole DN log (from startup until heartbeat
errors) please? I suspect its a problem with DN registration, that the
log will help confirm.

On Tue, Oct 30, 2012 at 4:40 PM, Takahiko Kawasaki <da...@gmail.com> wrote:
> Hello,
>
> I have trouble in quorum-based HDFS HA of CDH 4.1.1.
>
> NameNode Web UI of Cloudera Manager reports NameNode status.
> Its has "Cluster Summary" section and my cluster is summarized
> there like below.
>
> --- Cluster Summary ---
> Configured Capacity   : 0 KB
> DFS Used              : 0 KB
> Non DFS Used          : 0 KB
> DFS Remaining         : 0 KB
> DFS Used%             : 100 %
> DFS Remaining%        : 0 %
> Block Pool Used       : 0 KB
> Block Pool Used%      : 100 %
> DataNodes usages      : Min %  Median %  Max %  stdev %
>                           0 %       0 %    0 %      0 %
> Live Nodes            : 0 (Decommissioned: 0)
> Dead Nodes            : 5 (Decommissioned: 0)
> Decommissioning Nodes : 0
> --------------------
>
> As you can see, all the DataNodes are regarded as dead.
>
> I found DataNodes continued to emit logs about failure to
> send heartbeat to NameNode.
>
> ---- DataNode Log (host names were manually edited) ---
> 2012-10-30 19:28:16,817 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: For namenode
> node02.example.com/192.168.62.232:8020 using DELETEREPORT_INTERVAL of
> 300000 msec  BLOCKREPORT_INTERVAL of 21600000msec Initial delay:
> 0msec; heartBeatInterval=3000
> 2012-10-30 19:28:16,817 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
> BPOfferService for Block pool
> BP-2063217961-192.168.62.231-1351263110470 (storage id
> DS-2090122187-192.168.62.233-50010-1338981658216) service to
> node02.example.com/192.168.62.232:8020
> java.lang.NullPointerException
>         at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:435)
>         at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:521)
>         at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:674)
>         at java.lang.Thread.run(Thread.java:662)
> --------------------
>
> So, I guess that DataNodes are failing to locate the name service
> for some reasons, but I don't have any clue to solve the problem.
>
> I confirmed that
> /var/run/cloudera-scm-agent/process/???-hdfs-DATANODE/core-site.xml
> of a DataNode contains
>
> --- core-site.xml ---
>   <property>
>     <name>fs.defaultFS</name>
>     <value>hdfs://nameservice1</value>
>   </property>
> --------------------
>
> and hdfs-site.xml contains
>
> --- hdfs-site.xml ---
>   <property>
>     <name>dfs.nameservices</name>
>     <value>nameservice1</value>
>   </property>
>   <property>
>     <name>dfs.client.failover.proxy.provider.nameservice1</name>
>     <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
>   </property>
>   <property>
>     <name>dfs.ha.namenodes.nameservice1</name>
>     <value>namenode38,namenode90</value>
>   </property>
>   <property>
>     <name>dfs.namenode.rpc-address.nameservice1.namenode38</name>
>     <value>node01.example.com:8020</value>
>   </property>
>   <property>
>     <name>dfs.namenode.http-address.nameservice1.namenode38</name>
>     <value>node01.example.com:50070</value>
>   </property>
>   <property>
>     <name>dfs.namenode.https-address.nameservice1.namenode38</name>
>     <value>node01.example.com:50470</value>
>   </property>
>   <property>
>     <name>dfs.namenode.rpc-address.nameservice1.namenode90</name>
>     <value>node02.example.com:8020</value>
>   </property>
>   <property>
>     <name>dfs.namenode.http-address.nameservice1.namenode90</name>
>     <value>node02.example.com:50070</value>
>   </property>
>   <property>
>     <name>dfs.namenode.https-address.nameservice1.namenode90</name>
>     <value>jbmnode02.jibemobile.jp:50470</value>
>   </property>
>   <property>
>     <name>dfs.permissions.superusergroup</name>
>     <value>supergroup</value>
>   </property>
>   <property>
>     <name>dfs.replication</name>
>     <value>3</value>
>   </property>
>   <property>
>     <name>dfs.namenode.replication.min</name>
>     <value>1</value>
>   </property>
>   <property>
>     <name>dfs.replication.max</name>
>     <value>512</value>
>   </property>
> --------------------
>
> The following was my trial to create a file in HDFS but in vain.
>
> --------------------
> # vi /tmp/test.txt
> # sudo -u hdfs hadoop fs -mkdir /takahiko
> # sudo -u hdfs hadoop fs -ls /
> Found 3 items
> drwxr-xr-x   - hbase hbase               0 2012-10-30 15:12 /hbase
> drwxr-xr-x   - hdfs  supergroup          0 2012-10-30 18:55 /takahiko
> drwxrwxrwt   - hdfs  hdfs                0 2012-10-26 23:58 /tmp
> # sudo -u hdfs hadoop fs -copyFromLocal /tmp/test.txt /takahiko/
> 12/10/30 20:07:05 WARN hdfs.DFSClient: DataStreamer Exception
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> /takahiko/test.txt._COPYING_ could only be replicated to 0 nodes
> instead of minReplication (=1).  There are 0 datanode(s) running and
> no node(s) are excluded in this operation.
>         at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1322)
>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2170)
>         at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:471)
>         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297)
>         at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:898)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1693)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1689)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:396)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1687)
>
>         at org.apache.hadoop.ipc.Client.call(Client.java:1160)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
>         at $Proxy9.addBlock(Unknown Source)
>         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:290)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
>         at $Proxy10.addBlock(Unknown Source)
>         at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1150)
>         at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1003)
>         at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:463)
> copyFromLocal: File /takahiko/test.txt._COPYING_ could only be
> replicated to 0 nodes instead of minReplication (=1).  There are 0
> datanode(s) running and no node(s) are excluded in this operation.
> 12/10/30 20:07:05 ERROR hdfs.DFSClient: Failed to close file
> /takahiko/test.txt._COPYING_
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> /takahiko/test.txt._COPYING_ could only be replicated to 0 nodes
> instead of minReplication (=1).  There are 0 datanode(s) running and
> no node(s) are excluded in this operation.
>         at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1322)
>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2170)
>         at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:471)
>         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297)
>         at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:898)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1693)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1689)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:396)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1687)
>
>         at org.apache.hadoop.ipc.Client.call(Client.java:1160)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
>         at $Proxy9.addBlock(Unknown Source)
>         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:290)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
>         at $Proxy10.addBlock(Unknown Source)
>         at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1150)
>         at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1003)
>         at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:463)
> --------------------
>
>
> Could anyone give me any hint to solve the problem?
>
> Best Regards,
> Takahiko Kawasaki



-- 
Harsh J

Re: DataNodes fail to send heartbeat to HA-enabled NameNode

Posted by Harsh J <ha...@cloudera.com>.
Moving to cdh-user@cloudera.org
(https://groups.google.com/a/cloudera.org/forum/#!forum/cdh-user), as
it may be a CDH4 specific problem.

Could you share your whole DN log (from startup until heartbeat
errors) please? I suspect its a problem with DN registration, that the
log will help confirm.

On Tue, Oct 30, 2012 at 4:40 PM, Takahiko Kawasaki <da...@gmail.com> wrote:
> Hello,
>
> I have trouble in quorum-based HDFS HA of CDH 4.1.1.
>
> NameNode Web UI of Cloudera Manager reports NameNode status.
> Its has "Cluster Summary" section and my cluster is summarized
> there like below.
>
> --- Cluster Summary ---
> Configured Capacity   : 0 KB
> DFS Used              : 0 KB
> Non DFS Used          : 0 KB
> DFS Remaining         : 0 KB
> DFS Used%             : 100 %
> DFS Remaining%        : 0 %
> Block Pool Used       : 0 KB
> Block Pool Used%      : 100 %
> DataNodes usages      : Min %  Median %  Max %  stdev %
>                           0 %       0 %    0 %      0 %
> Live Nodes            : 0 (Decommissioned: 0)
> Dead Nodes            : 5 (Decommissioned: 0)
> Decommissioning Nodes : 0
> --------------------
>
> As you can see, all the DataNodes are regarded as dead.
>
> I found DataNodes continued to emit logs about failure to
> send heartbeat to NameNode.
>
> ---- DataNode Log (host names were manually edited) ---
> 2012-10-30 19:28:16,817 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: For namenode
> node02.example.com/192.168.62.232:8020 using DELETEREPORT_INTERVAL of
> 300000 msec  BLOCKREPORT_INTERVAL of 21600000msec Initial delay:
> 0msec; heartBeatInterval=3000
> 2012-10-30 19:28:16,817 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
> BPOfferService for Block pool
> BP-2063217961-192.168.62.231-1351263110470 (storage id
> DS-2090122187-192.168.62.233-50010-1338981658216) service to
> node02.example.com/192.168.62.232:8020
> java.lang.NullPointerException
>         at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:435)
>         at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:521)
>         at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:674)
>         at java.lang.Thread.run(Thread.java:662)
> --------------------
>
> So, I guess that DataNodes are failing to locate the name service
> for some reasons, but I don't have any clue to solve the problem.
>
> I confirmed that
> /var/run/cloudera-scm-agent/process/???-hdfs-DATANODE/core-site.xml
> of a DataNode contains
>
> --- core-site.xml ---
>   <property>
>     <name>fs.defaultFS</name>
>     <value>hdfs://nameservice1</value>
>   </property>
> --------------------
>
> and hdfs-site.xml contains
>
> --- hdfs-site.xml ---
>   <property>
>     <name>dfs.nameservices</name>
>     <value>nameservice1</value>
>   </property>
>   <property>
>     <name>dfs.client.failover.proxy.provider.nameservice1</name>
>     <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
>   </property>
>   <property>
>     <name>dfs.ha.namenodes.nameservice1</name>
>     <value>namenode38,namenode90</value>
>   </property>
>   <property>
>     <name>dfs.namenode.rpc-address.nameservice1.namenode38</name>
>     <value>node01.example.com:8020</value>
>   </property>
>   <property>
>     <name>dfs.namenode.http-address.nameservice1.namenode38</name>
>     <value>node01.example.com:50070</value>
>   </property>
>   <property>
>     <name>dfs.namenode.https-address.nameservice1.namenode38</name>
>     <value>node01.example.com:50470</value>
>   </property>
>   <property>
>     <name>dfs.namenode.rpc-address.nameservice1.namenode90</name>
>     <value>node02.example.com:8020</value>
>   </property>
>   <property>
>     <name>dfs.namenode.http-address.nameservice1.namenode90</name>
>     <value>node02.example.com:50070</value>
>   </property>
>   <property>
>     <name>dfs.namenode.https-address.nameservice1.namenode90</name>
>     <value>jbmnode02.jibemobile.jp:50470</value>
>   </property>
>   <property>
>     <name>dfs.permissions.superusergroup</name>
>     <value>supergroup</value>
>   </property>
>   <property>
>     <name>dfs.replication</name>
>     <value>3</value>
>   </property>
>   <property>
>     <name>dfs.namenode.replication.min</name>
>     <value>1</value>
>   </property>
>   <property>
>     <name>dfs.replication.max</name>
>     <value>512</value>
>   </property>
> --------------------
>
> The following was my trial to create a file in HDFS but in vain.
>
> --------------------
> # vi /tmp/test.txt
> # sudo -u hdfs hadoop fs -mkdir /takahiko
> # sudo -u hdfs hadoop fs -ls /
> Found 3 items
> drwxr-xr-x   - hbase hbase               0 2012-10-30 15:12 /hbase
> drwxr-xr-x   - hdfs  supergroup          0 2012-10-30 18:55 /takahiko
> drwxrwxrwt   - hdfs  hdfs                0 2012-10-26 23:58 /tmp
> # sudo -u hdfs hadoop fs -copyFromLocal /tmp/test.txt /takahiko/
> 12/10/30 20:07:05 WARN hdfs.DFSClient: DataStreamer Exception
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> /takahiko/test.txt._COPYING_ could only be replicated to 0 nodes
> instead of minReplication (=1).  There are 0 datanode(s) running and
> no node(s) are excluded in this operation.
>         at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1322)
>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2170)
>         at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:471)
>         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297)
>         at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:898)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1693)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1689)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:396)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1687)
>
>         at org.apache.hadoop.ipc.Client.call(Client.java:1160)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
>         at $Proxy9.addBlock(Unknown Source)
>         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:290)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
>         at $Proxy10.addBlock(Unknown Source)
>         at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1150)
>         at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1003)
>         at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:463)
> copyFromLocal: File /takahiko/test.txt._COPYING_ could only be
> replicated to 0 nodes instead of minReplication (=1).  There are 0
> datanode(s) running and no node(s) are excluded in this operation.
> 12/10/30 20:07:05 ERROR hdfs.DFSClient: Failed to close file
> /takahiko/test.txt._COPYING_
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> /takahiko/test.txt._COPYING_ could only be replicated to 0 nodes
> instead of minReplication (=1).  There are 0 datanode(s) running and
> no node(s) are excluded in this operation.
>         at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1322)
>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2170)
>         at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:471)
>         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297)
>         at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:898)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1693)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1689)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:396)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1687)
>
>         at org.apache.hadoop.ipc.Client.call(Client.java:1160)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
>         at $Proxy9.addBlock(Unknown Source)
>         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:290)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
>         at $Proxy10.addBlock(Unknown Source)
>         at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1150)
>         at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1003)
>         at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:463)
> --------------------
>
>
> Could anyone give me any hint to solve the problem?
>
> Best Regards,
> Takahiko Kawasaki



-- 
Harsh J

Re: DataNodes fail to send heartbeat to HA-enabled NameNode

Posted by Harsh J <ha...@cloudera.com>.
Moving to cdh-user@cloudera.org
(https://groups.google.com/a/cloudera.org/forum/#!forum/cdh-user), as
it may be a CDH4 specific problem.

Could you share your whole DN log (from startup until heartbeat
errors) please? I suspect its a problem with DN registration, that the
log will help confirm.

On Tue, Oct 30, 2012 at 4:40 PM, Takahiko Kawasaki <da...@gmail.com> wrote:
> Hello,
>
> I have trouble in quorum-based HDFS HA of CDH 4.1.1.
>
> NameNode Web UI of Cloudera Manager reports NameNode status.
> Its has "Cluster Summary" section and my cluster is summarized
> there like below.
>
> --- Cluster Summary ---
> Configured Capacity   : 0 KB
> DFS Used              : 0 KB
> Non DFS Used          : 0 KB
> DFS Remaining         : 0 KB
> DFS Used%             : 100 %
> DFS Remaining%        : 0 %
> Block Pool Used       : 0 KB
> Block Pool Used%      : 100 %
> DataNodes usages      : Min %  Median %  Max %  stdev %
>                           0 %       0 %    0 %      0 %
> Live Nodes            : 0 (Decommissioned: 0)
> Dead Nodes            : 5 (Decommissioned: 0)
> Decommissioning Nodes : 0
> --------------------
>
> As you can see, all the DataNodes are regarded as dead.
>
> I found DataNodes continued to emit logs about failure to
> send heartbeat to NameNode.
>
> ---- DataNode Log (host names were manually edited) ---
> 2012-10-30 19:28:16,817 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: For namenode
> node02.example.com/192.168.62.232:8020 using DELETEREPORT_INTERVAL of
> 300000 msec  BLOCKREPORT_INTERVAL of 21600000msec Initial delay:
> 0msec; heartBeatInterval=3000
> 2012-10-30 19:28:16,817 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
> BPOfferService for Block pool
> BP-2063217961-192.168.62.231-1351263110470 (storage id
> DS-2090122187-192.168.62.233-50010-1338981658216) service to
> node02.example.com/192.168.62.232:8020
> java.lang.NullPointerException
>         at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:435)
>         at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:521)
>         at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:674)
>         at java.lang.Thread.run(Thread.java:662)
> --------------------
>
> So, I guess that DataNodes are failing to locate the name service
> for some reasons, but I don't have any clue to solve the problem.
>
> I confirmed that
> /var/run/cloudera-scm-agent/process/???-hdfs-DATANODE/core-site.xml
> of a DataNode contains
>
> --- core-site.xml ---
>   <property>
>     <name>fs.defaultFS</name>
>     <value>hdfs://nameservice1</value>
>   </property>
> --------------------
>
> and hdfs-site.xml contains
>
> --- hdfs-site.xml ---
>   <property>
>     <name>dfs.nameservices</name>
>     <value>nameservice1</value>
>   </property>
>   <property>
>     <name>dfs.client.failover.proxy.provider.nameservice1</name>
>     <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
>   </property>
>   <property>
>     <name>dfs.ha.namenodes.nameservice1</name>
>     <value>namenode38,namenode90</value>
>   </property>
>   <property>
>     <name>dfs.namenode.rpc-address.nameservice1.namenode38</name>
>     <value>node01.example.com:8020</value>
>   </property>
>   <property>
>     <name>dfs.namenode.http-address.nameservice1.namenode38</name>
>     <value>node01.example.com:50070</value>
>   </property>
>   <property>
>     <name>dfs.namenode.https-address.nameservice1.namenode38</name>
>     <value>node01.example.com:50470</value>
>   </property>
>   <property>
>     <name>dfs.namenode.rpc-address.nameservice1.namenode90</name>
>     <value>node02.example.com:8020</value>
>   </property>
>   <property>
>     <name>dfs.namenode.http-address.nameservice1.namenode90</name>
>     <value>node02.example.com:50070</value>
>   </property>
>   <property>
>     <name>dfs.namenode.https-address.nameservice1.namenode90</name>
>     <value>jbmnode02.jibemobile.jp:50470</value>
>   </property>
>   <property>
>     <name>dfs.permissions.superusergroup</name>
>     <value>supergroup</value>
>   </property>
>   <property>
>     <name>dfs.replication</name>
>     <value>3</value>
>   </property>
>   <property>
>     <name>dfs.namenode.replication.min</name>
>     <value>1</value>
>   </property>
>   <property>
>     <name>dfs.replication.max</name>
>     <value>512</value>
>   </property>
> --------------------
>
> The following was my trial to create a file in HDFS but in vain.
>
> --------------------
> # vi /tmp/test.txt
> # sudo -u hdfs hadoop fs -mkdir /takahiko
> # sudo -u hdfs hadoop fs -ls /
> Found 3 items
> drwxr-xr-x   - hbase hbase               0 2012-10-30 15:12 /hbase
> drwxr-xr-x   - hdfs  supergroup          0 2012-10-30 18:55 /takahiko
> drwxrwxrwt   - hdfs  hdfs                0 2012-10-26 23:58 /tmp
> # sudo -u hdfs hadoop fs -copyFromLocal /tmp/test.txt /takahiko/
> 12/10/30 20:07:05 WARN hdfs.DFSClient: DataStreamer Exception
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> /takahiko/test.txt._COPYING_ could only be replicated to 0 nodes
> instead of minReplication (=1).  There are 0 datanode(s) running and
> no node(s) are excluded in this operation.
>         at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1322)
>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2170)
>         at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:471)
>         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297)
>         at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:898)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1693)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1689)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:396)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1687)
>
>         at org.apache.hadoop.ipc.Client.call(Client.java:1160)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
>         at $Proxy9.addBlock(Unknown Source)
>         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:290)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
>         at $Proxy10.addBlock(Unknown Source)
>         at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1150)
>         at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1003)
>         at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:463)
> copyFromLocal: File /takahiko/test.txt._COPYING_ could only be
> replicated to 0 nodes instead of minReplication (=1).  There are 0
> datanode(s) running and no node(s) are excluded in this operation.
> 12/10/30 20:07:05 ERROR hdfs.DFSClient: Failed to close file
> /takahiko/test.txt._COPYING_
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> /takahiko/test.txt._COPYING_ could only be replicated to 0 nodes
> instead of minReplication (=1).  There are 0 datanode(s) running and
> no node(s) are excluded in this operation.
>         at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1322)
>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2170)
>         at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:471)
>         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297)
>         at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:898)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1693)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1689)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:396)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1687)
>
>         at org.apache.hadoop.ipc.Client.call(Client.java:1160)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
>         at $Proxy9.addBlock(Unknown Source)
>         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:290)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
>         at $Proxy10.addBlock(Unknown Source)
>         at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1150)
>         at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1003)
>         at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:463)
> --------------------
>
>
> Could anyone give me any hint to solve the problem?
>
> Best Regards,
> Takahiko Kawasaki



-- 
Harsh J

Re: DataNodes fail to send heartbeat to HA-enabled NameNode

Posted by Harsh J <ha...@cloudera.com>.
Moving to cdh-user@cloudera.org
(https://groups.google.com/a/cloudera.org/forum/#!forum/cdh-user), as
it may be a CDH4 specific problem.

Could you share your whole DN log (from startup until heartbeat
errors) please? I suspect its a problem with DN registration, that the
log will help confirm.

On Tue, Oct 30, 2012 at 4:40 PM, Takahiko Kawasaki <da...@gmail.com> wrote:
> Hello,
>
> I have trouble in quorum-based HDFS HA of CDH 4.1.1.
>
> NameNode Web UI of Cloudera Manager reports NameNode status.
> Its has "Cluster Summary" section and my cluster is summarized
> there like below.
>
> --- Cluster Summary ---
> Configured Capacity   : 0 KB
> DFS Used              : 0 KB
> Non DFS Used          : 0 KB
> DFS Remaining         : 0 KB
> DFS Used%             : 100 %
> DFS Remaining%        : 0 %
> Block Pool Used       : 0 KB
> Block Pool Used%      : 100 %
> DataNodes usages      : Min %  Median %  Max %  stdev %
>                           0 %       0 %    0 %      0 %
> Live Nodes            : 0 (Decommissioned: 0)
> Dead Nodes            : 5 (Decommissioned: 0)
> Decommissioning Nodes : 0
> --------------------
>
> As you can see, all the DataNodes are regarded as dead.
>
> I found DataNodes continued to emit logs about failure to
> send heartbeat to NameNode.
>
> ---- DataNode Log (host names were manually edited) ---
> 2012-10-30 19:28:16,817 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: For namenode
> node02.example.com/192.168.62.232:8020 using DELETEREPORT_INTERVAL of
> 300000 msec  BLOCKREPORT_INTERVAL of 21600000msec Initial delay:
> 0msec; heartBeatInterval=3000
> 2012-10-30 19:28:16,817 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
> BPOfferService for Block pool
> BP-2063217961-192.168.62.231-1351263110470 (storage id
> DS-2090122187-192.168.62.233-50010-1338981658216) service to
> node02.example.com/192.168.62.232:8020
> java.lang.NullPointerException
>         at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:435)
>         at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:521)
>         at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:674)
>         at java.lang.Thread.run(Thread.java:662)
> --------------------
>
> So, I guess that DataNodes are failing to locate the name service
> for some reasons, but I don't have any clue to solve the problem.
>
> I confirmed that
> /var/run/cloudera-scm-agent/process/???-hdfs-DATANODE/core-site.xml
> of a DataNode contains
>
> --- core-site.xml ---
>   <property>
>     <name>fs.defaultFS</name>
>     <value>hdfs://nameservice1</value>
>   </property>
> --------------------
>
> and hdfs-site.xml contains
>
> --- hdfs-site.xml ---
>   <property>
>     <name>dfs.nameservices</name>
>     <value>nameservice1</value>
>   </property>
>   <property>
>     <name>dfs.client.failover.proxy.provider.nameservice1</name>
>     <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
>   </property>
>   <property>
>     <name>dfs.ha.namenodes.nameservice1</name>
>     <value>namenode38,namenode90</value>
>   </property>
>   <property>
>     <name>dfs.namenode.rpc-address.nameservice1.namenode38</name>
>     <value>node01.example.com:8020</value>
>   </property>
>   <property>
>     <name>dfs.namenode.http-address.nameservice1.namenode38</name>
>     <value>node01.example.com:50070</value>
>   </property>
>   <property>
>     <name>dfs.namenode.https-address.nameservice1.namenode38</name>
>     <value>node01.example.com:50470</value>
>   </property>
>   <property>
>     <name>dfs.namenode.rpc-address.nameservice1.namenode90</name>
>     <value>node02.example.com:8020</value>
>   </property>
>   <property>
>     <name>dfs.namenode.http-address.nameservice1.namenode90</name>
>     <value>node02.example.com:50070</value>
>   </property>
>   <property>
>     <name>dfs.namenode.https-address.nameservice1.namenode90</name>
>     <value>jbmnode02.jibemobile.jp:50470</value>
>   </property>
>   <property>
>     <name>dfs.permissions.superusergroup</name>
>     <value>supergroup</value>
>   </property>
>   <property>
>     <name>dfs.replication</name>
>     <value>3</value>
>   </property>
>   <property>
>     <name>dfs.namenode.replication.min</name>
>     <value>1</value>
>   </property>
>   <property>
>     <name>dfs.replication.max</name>
>     <value>512</value>
>   </property>
> --------------------
>
> The following was my trial to create a file in HDFS but in vain.
>
> --------------------
> # vi /tmp/test.txt
> # sudo -u hdfs hadoop fs -mkdir /takahiko
> # sudo -u hdfs hadoop fs -ls /
> Found 3 items
> drwxr-xr-x   - hbase hbase               0 2012-10-30 15:12 /hbase
> drwxr-xr-x   - hdfs  supergroup          0 2012-10-30 18:55 /takahiko
> drwxrwxrwt   - hdfs  hdfs                0 2012-10-26 23:58 /tmp
> # sudo -u hdfs hadoop fs -copyFromLocal /tmp/test.txt /takahiko/
> 12/10/30 20:07:05 WARN hdfs.DFSClient: DataStreamer Exception
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> /takahiko/test.txt._COPYING_ could only be replicated to 0 nodes
> instead of minReplication (=1).  There are 0 datanode(s) running and
> no node(s) are excluded in this operation.
>         at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1322)
>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2170)
>         at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:471)
>         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297)
>         at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:898)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1693)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1689)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:396)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1687)
>
>         at org.apache.hadoop.ipc.Client.call(Client.java:1160)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
>         at $Proxy9.addBlock(Unknown Source)
>         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:290)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
>         at $Proxy10.addBlock(Unknown Source)
>         at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1150)
>         at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1003)
>         at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:463)
> copyFromLocal: File /takahiko/test.txt._COPYING_ could only be
> replicated to 0 nodes instead of minReplication (=1).  There are 0
> datanode(s) running and no node(s) are excluded in this operation.
> 12/10/30 20:07:05 ERROR hdfs.DFSClient: Failed to close file
> /takahiko/test.txt._COPYING_
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> /takahiko/test.txt._COPYING_ could only be replicated to 0 nodes
> instead of minReplication (=1).  There are 0 datanode(s) running and
> no node(s) are excluded in this operation.
>         at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1322)
>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2170)
>         at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:471)
>         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297)
>         at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:898)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1693)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1689)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:396)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1687)
>
>         at org.apache.hadoop.ipc.Client.call(Client.java:1160)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
>         at $Proxy9.addBlock(Unknown Source)
>         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:290)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
>         at $Proxy10.addBlock(Unknown Source)
>         at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1150)
>         at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1003)
>         at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:463)
> --------------------
>
>
> Could anyone give me any hint to solve the problem?
>
> Best Regards,
> Takahiko Kawasaki



-- 
Harsh J