You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Keith Wyss <wy...@gmail.com> on 2013/04/17 22:45:00 UTC
How to fix Under-replication

Hi. I sent this a few minutes ago, but I had not confirmed subscription to
the mailing list so I don't think it went through. If it did, I apologize
for the re-post

-----------------------------

Hello there.

I am operating a cluster that is consistently unable to create three
replicas for a large percentage of blocks.

I think I have a good idea of why this is the case, but I would like
suggestions about how to fix it.

First of all, we can begin with the namenode logs.

There are lots of incidences of this statement:
WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Not able to place
enough replicas, still in need of 1

The cluster is only just over 50% full and has well over 3 nodes. This and
the lack of other widespread areas rules out the possibility that there is
simply not room for the blocks.

This leaves the possibility that the namenode is unable to satisfy the the
block placement policy. I believe that this is what is happening.

I read in
http://www.slideshare.net/cloudera/hadoop-troubleshooting-101-kate-ting-cloudera
that if there are more than 2 racks, then a block must be present on at
least two racks.

This makes sense, but our network situation is a little bizarre. It
consists of:
-a small amount of machines that have a dedicated datacenter/rack/host
configuration
-- These are spread across a few racks.
-a large amount of machines that are provisioned using an internal hardware
as a service provider.
-- These are listed as one rack.

The details of the rack allocation for the machines that are provisioned
from the hardware as a service provider are abstracted away and are not
attainable. The connection to the hardware provisioned as a service has a
lot of bandwidth, so this is not as crazy as it sounds.

Our problem is that the machines on all the smaller racks have now filled
up the amount of space partitioned
by dfs.datanode.du.reserved. This means that all of the blocks since these
machines ran out of space are lacking one replication.

Is there a way to configure Hadoop to create a third replication anyway
(aside from changing the hadoop.topology.script implementation)?

What can I do to either confirm or deny my suspicions?

Thanks for your help,
Keith