You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by David Ginzburg <gi...@hotmail.com> on 2011/01/20 09:42:17 UTC

Adding new data nodes to existing cluster, with different storage capcity

Hi,
Our current cluster runs with 22 data nodes - each with 4TB .
We should be installing new data nodes on this existing cluster , but each will have 8TB of storage capacity.
I am wondering how will the namenode distribute the blocks, It is my understanding that Replica Placement policy is that data nodes are chosen at random, so an even distribution
is expected , So eventually the smaller nodes
will fill up while the larger nodes will reach 50% at which point the small
nodes will become unusable. 
Am I correct? 
Is there any recommended practice in this case? would running a balancer periodically help?

RE: Adding new data nodes to existing cluster, with different storage capcity

Posted by David Ginzburg <gi...@hotmail.com>.

Thank you

Date: Thu, 20 Jan 2011 09:26:36 -0800
From: ayonsinha@yahoo.com
Subject: Re: Adding new data nodes to existing cluster, with different storage capcity
To: hdfs-user@hadoop.apache.org

We did the same exercise a few months back. When we run the balancer which takes a while to balance, it will balance based on the percentage of disk usage on each node, so you will end up with usage of nodes between say 45-55% on all nodes.Sometimes the balancer does not balance well initially, in which case, we increased the rep factor to 4 and kept it that way for a few day while running the balancer. Then we brought down the rep factor back to 3 and let the balancer run.
 -Ayon

From: David Ginzburg <gi...@hotmail.com>
To: HDFS USER mail list <hd...@hadoop.apache.org>
Sent: Thu, January 20, 2011 12:42:17 AM
Subject: Adding new data nodes to existing cluster, with different storage capcity

Hi,
Our current cluster runs with 22 data nodes - each with 4TB .
We should be installing new data nodes on this existing cluster , but each will have 8TB of storage capacity.
I am wondering how will the namenode distribute the blocks, It is my understanding that Replica Placement policy is that data nodes are chosen at random, so an even distribution
is expected , So eventually the smaller nodes
will fill up while the larger nodes will reach 50% at which point the small
nodes will become unusable. 
Am I correct? 
Is there any recommended practice in this case? would running a balancer periodically help?

Re: Adding new data nodes to existing cluster, with different storage capcity

Posted by Ayon Sinha <ay...@yahoo.com>.

We did the same exercise a few months back. When we run the balancer which takes 
a while to balance, it will balance based on the percentage of disk usage on 
each node, so you will end up with usage of nodes between say 45-55% on all 
nodes.
Sometimes the balancer does not balance well initially, in which case, we 
increased the rep factor to 4 and kept it that way for a few day while running 
the balancer. Then we brought down the rep factor back to 3 and let the balancer 
run.
 -Ayon




________________________________
From: David Ginzburg <gi...@hotmail.com>
To: HDFS USER mail list <hd...@hadoop.apache.org>
Sent: Thu, January 20, 2011 12:42:17 AM
Subject: Adding new data nodes to existing cluster, with different storage 
capcity

 Hi,
Our current cluster runs with 22 data nodes - each with 4TB .
We should be installing new data nodes on this existing cluster , but each will 
have 8TB of storage capacity.
I am wondering how will the namenode distribute the blocks, It is my 
understanding thatReplica Placement policy is that data nodes are chosen at 
random, so an even distribution is expected , So eventually the smaller nodes 
will fill up while the larger nodes will reach 50% at which point the small 
nodes will become unusable. 

Am I correct? 
Is there any recommended practice in this case? would running a balancer 
periodically help?