You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-dev@hadoop.apache.org by "Benoit Sigoure (JIRA)" <ji...@apache.org> on 2015/08/26 20:05:46 UTC

[jira] [Created] (HDFS-8960) DFS client says "no more good datanodes being available to try" on a single drive failure

Benoit Sigoure created HDFS-8960:
------------------------------------

             Summary: DFS client says "no more good datanodes being available to try" on a single drive failure
                 Key: HDFS-8960
                 URL: https://issues.apache.org/jira/browse/HDFS-8960
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: hdfs-client
    Affects Versions: 2.7.1
         Environment: openjdk version "1.8.0_45-internal"
OpenJDK Runtime Environment (build 1.8.0_45-internal-b14)
OpenJDK 64-Bit Server VM (build 25.45-b02, mixed mode)
            Reporter: Benoit Sigoure


Since we upgraded to 2.7.1 we regularly see single-drive failures cause widespread problems at the HBase level (with the default 3x replication target).

Here's an example.  This HBase RegionServer is r12s16 (172.24.32.16) and is writing its WAL to [172.24.32.16:10110, 172.24.32.8:10110, 172.24.32.13:10110] as can be seen by the following occasional messages:

{code}
2015-08-23 06:28:40,272 INFO  [sync.3] wal.FSHLog: Slow sync cost: 123 ms, current pipeline: [172.24.32.16:10110, 172.24.32.8:10110, 172.24.32.13:10110]
{code}

A bit later, the second node in the pipeline above is going to experience an HDD failure.

{code}
2015-08-23 07:21:58,720 WARN  [DataStreamer for file /hbase/WALs/r12s16.sjc.aristanetworks.com,9104,1439917659071/r12s16.sjc.aristanetworks.com%2C9104%2C1439917659071.default.1440314434998 block BP-1466258523-172.24.32.1-1437768622582:blk_1073817519_77099] hdfs.DFSClient: Error Recovery for block BP-1466258523-172.24.32.1-1437768622582:blk_1073817519_77099 in pipeline 172.24.32.16:10110, 172.24.32.13:10110, 172.24.32.8:10110: bad datanode 172.24.32.8:10110
{code}

And then HBase will go like "omg I can't write to my WAL, let me commit suicide".

{code}
2015-08-23 07:22:26,060 FATAL [regionserver/r12s16.sjc.aristanetworks.com/172.24.32.16:9104.append-pool1-t1] wal.FSHLog: Could not append. Requesting close of wal
java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[172.24.32.16:10110, 172.24.32.13:10110], original=[172.24.32.16:10110, 172.24.32.13:10110]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:969)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1035)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1184)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:933)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:487)
{code}

Whereas this should be mostly a non-event as the DFS client should just drop the bad replica from the write pipeline.

This is a small cluster but has 16 DNs so the failed DN in the pipeline should be easily replaced.  I didn't set {{dfs.client.block.write.replace-datanode-on-failure.policy}} (so it's still {{DEFAULT}}) and didn't set {{dfs.client.block.write.replace-datanode-on-failure.enable}} (so it's still {{true}}).

I don't see anything noteworthy in the NN log around the time of the failure, it just seems like the DFS client gave up or threw an exception back to HBase that it wasn't throwing before or something else, and that made this single drive failure lethal.

We've occasionally be "unlucky" enough to have a single-drive failure cause multiple RegionServers to commit suicide because they had their WALs on that drive.

We upgraded from 2.7.0 about a month ago, and I'm not sure whether we were seeing this with 2.7 or not – prior to that we were running in a quite different environment, but this is a fairly new deployment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)