You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-dev@hadoop.apache.org by Chris Nauroth <cn...@hortonworks.com> on 2015/09/09 18:25:11 UTC

Re: Any Error during BPOfferService run can leads to Missing DN.

Hello Nijel,

Thank you for reporting this.  I think it is invalid for a DataNode process in an HA deployment ever to be running with only one of its BPServiceActor threads alive.  As you observed, this can lead to a condition in which a DataNode is registered with one NameNode but not the other.  After an HA failover, it will appear that the DataNode has vanished from the cluster, and the only resolution is to restart that DataNode.  In extreme cases, if this happens for multiple DataNodes, then the cluster will appear to experience a massive loss of capacity, under-replicated blocks, and a storm of re-replication activity.

HDFS-2882 and HDFS-7714 demonstrated similar symptoms, but the root cause was different.  I encourage you to file a new HDFS JIRA with your findings, because it's a slightly different bug.  We can continue discussion of proposed solutions there.  At a high level, I think we'll need to explore either aborting the whole DataNode process or attempting to recover the failed BPServiceActor thread.  I'm not yet sure which approach is preferable.

--Chris Nauroth

From: Nijel s f <ni...@huawei.com>>
Reply-To: Hadoop Common <co...@hadoop.apache.org>>
Date: Wednesday, September 9, 2015 at 1:40 AM
To: Hadoop Common <co...@hadoop.apache.org>>, "hdfs-dev@hadoop.apache.org<ma...@hadoop.apache.org>" <hd...@hadoop.apache.org>>
Subject: FW: Any Error during BPOfferService run can leads to Missing DN.


+hdfs-dev


From: Nijel s f
Sent: 08 September 2015 19:17
To: Hadoop Common
Cc: 'Nijel S F'
Subject: Any Error during BPOfferService run can leads to Missing DN.

Hi all

I got an issue from one of my site.
The cluster is ins HA mode and each DN having only one block pool.

The issue is once after switch one DN is missing from the current active NN.
Upon analysis I found that there is one exception in BPOfferService.run()


2015-08-21 09:02:11,190 | WARN  | DataNode: [[[DISK]file:/srv/BigData/hadoop/data5/dn/ [DISK]file:/srv/BigData/hadoop/data4/dn/]]  heartbeating to 160-149-0-114/160.149.0.114:25000 | Unexpected exception in block pool Block pool BP-284203724-160.149.0.114-1438774011693 (Datanode Uuid 15ce1dd7-227f-4fd2-9682-091aa6bc2b89) service to 160-149-0-114/160.149.0.114:25000 | BPServiceActor.java:830
java.lang.OutOfMemoryError: unable to create new native thread
                at java.lang.Thread.start0(Native Method)
                at java.lang.Thread.start(Thread.java:714)
                at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:950)
                at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1357)
                at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:172)
                at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:221)
                at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:1887)
                at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:669)
                at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:616)
                at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:856)
                at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:671)
                at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:822)
                at java.lang.Thread.run(Thread.java:745)

After this particular BPOfferService is down during the run time.
And this particular NN will not have the details of this DN

Similar issues are discussed in the following JIRAs
https://issues.apache.org/jira/browse/HDFS-2882
https://issues.apache.org/jira/browse/HDFS-7714

Can we retry in this case also with a larger interval instead of shutting down this BPOfferService ?
I think since this exceptions can occur randomly in DN it is not good to keep the DN running where some NN does not have the info !

Please give your suggestions.

________________________________


[huawei_logo]2012Labs
Huawei Technologies India Pvt. Ltd.
Mobile: +91 9741149179
@nijelsf<https://twitter.com/nijelsf>

RE: Any Error during BPOfferService run can leads to Missing DN.

Posted by Nijel s f <ni...@huawei.com>.

Hi chris
Thanks for the comments
I filed the jira https://issues.apache.org/jira/browse/HDFS-9046 and given an initial patch to retry on failure
We will further discuss in JIRA

Thanks
nijel
Huawei Technologies India Pvt. Ltd.
@nijelsf


-----Original Message-----
From: Chris Nauroth [mailto:cnauroth@hortonworks.com] 
Sent: 09 September 2015 21:55
To: Hadoop Common; hdfs-dev@hadoop.apache.org
Subject: Re: Any Error during BPOfferService run can leads to Missing DN.

Hello Nijel,

Thank you for reporting this.  I think it is invalid for a DataNode process in an HA deployment ever to be running with only one of its BPServiceActor threads alive.  As you observed, this can lead to a condition in which a DataNode is registered with one NameNode but not the other.  After an HA failover, it will appear that the DataNode has vanished from the cluster, and the only resolution is to restart that DataNode.  In extreme cases, if this happens for multiple DataNodes, then the cluster will appear to experience a massive loss of capacity, under-replicated blocks, and a storm of re-replication activity.

HDFS-2882 and HDFS-7714 demonstrated similar symptoms, but the root cause was different.  I encourage you to file a new HDFS JIRA with your findings, because it's a slightly different bug.  We can continue discussion of proposed solutions there.  At a high level, I think we'll need to explore either aborting the whole DataNode process or attempting to recover the failed BPServiceActor thread.  I'm not yet sure which approach is preferable.

--Chris Nauroth

From: Nijel s f <ni...@huawei.com>>
Reply-To: Hadoop Common <co...@hadoop.apache.org>>
Date: Wednesday, September 9, 2015 at 1:40 AM
To: Hadoop Common <co...@hadoop.apache.org>>, "hdfs-dev@hadoop.apache.org<ma...@hadoop.apache.org>" <hd...@hadoop.apache.org>>
Subject: FW: Any Error during BPOfferService run can leads to Missing DN.


+hdfs-dev


From: Nijel s f
Sent: 08 September 2015 19:17
To: Hadoop Common
Cc: 'Nijel S F'
Subject: Any Error during BPOfferService run can leads to Missing DN.

Hi all

I got an issue from one of my site.
The cluster is ins HA mode and each DN having only one block pool.

The issue is once after switch one DN is missing from the current active NN.
Upon analysis I found that there is one exception in BPOfferService.run()


2015-08-21 09:02:11,190 | WARN  | DataNode: [[[DISK]file:/srv/BigData/hadoop/data5/dn/ [DISK]file:/srv/BigData/hadoop/data4/dn/]]  heartbeating to 160-149-0-114/160.149.0.114:25000 | Unexpected exception in block pool Block pool BP-284203724-160.149.0.114-1438774011693 (Datanode Uuid 15ce1dd7-227f-4fd2-9682-091aa6bc2b89) service to 160-149-0-114/160.149.0.114:25000 | BPServiceActor.java:830
java.lang.OutOfMemoryError: unable to create new native thread
                at java.lang.Thread.start0(Native Method)
                at java.lang.Thread.start(Thread.java:714)
                at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:950)
                at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1357)
                at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:172)
                at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:221)
                at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:1887)
                at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:669)
                at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:616)
                at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:856)
                at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:671)
                at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:822)
                at java.lang.Thread.run(Thread.java:745)

After this particular BPOfferService is down during the run time.
And this particular NN will not have the details of this DN

Similar issues are discussed in the following JIRAs
https://issues.apache.org/jira/browse/HDFS-2882
https://issues.apache.org/jira/browse/HDFS-7714

Can we retry in this case also with a larger interval instead of shutting down this BPOfferService ?
I think since this exceptions can occur randomly in DN it is not good to keep the DN running where some NN does not have the info !

Please give your suggestions.

________________________________


[huawei_logo]2012Labs
Huawei Technologies India Pvt. Ltd.
Mobile: +91 9741149179
@nijelsf<https://twitter.com/nijelsf>