You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Marc Sturlese <ma...@gmail.com> on 2013/08/22 09:41:52 UTC

rack awarness unexpected behaviour

Hey there,
I've set up rack awareness on my hadoop cluster with replication 3. I have 2
racks and each contains 50% of the nodes.
I can see that the blocks are spread on the 2 racks, the problem is that all
nodes from a rack are storing 2 replicas and the nodes of the other rack
just one. If I launch the hadoop balancer script, it will properly spread
the replicas across the 2 racks, leaving all nodes with exactly the same
available disk space but, after jobs are running for hours, the data will be
unbalanced again (rack1 having all nodes with less empty disk space than all
nodes from rack2)

Any clue whats going on?
Thanks in advance



--
View this message in context: http://lucene.472066.n3.nabble.com/rack-awarness-unexpected-behaviour-tp4086029.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.

Re: rack awarness unexpected behaviour

Posted by Jun Ping Du <jd...@vmware.com>.

The current HDFS's default replica placement policy don't fit two biased racks case very well: assume local rack has more nodes, which means more reducer slots and more disk capacity, then more reducer tasks will be executed within local rack. According to replica placement policy, it will put 1 replica on local rack and 2 replicas on remote rack which means data load are doubled in remote rack although less capacity there.
The workaround of cheating rack-aware script (like described below) may help to resolve unbalanced data problem but will take following two issues:
1. data reliability - all 3 replicas of some blocks may fall into the same "real" rack.
2. rack level data locality - no matter task scheduling or replica choosing in HDFS read will get mis-understand on real rack topology.
See if this is tradeoff you want to get in your case.
Another workaround, although not design for this case, may be helpful: to enable "NodeGroup" level of locality that between node and rack which is supported after 1.2.0. Nodes under the same "NodeGroup" can only have one replica placed which is designed for getting rid of replica duplicated for VMs on the same host. Specifically in your case, assume you have 20 machines in rack A and 10 machines in rack B, you can put rack A nodes to two NodeGroups (so each NodeGroup has 10 nodes) and rack B nodes to one NodeGroups. In this case, the replica will be distributed in ratio of 2:1, no matter where the writer is. Hope it helps.

Thanks, 

Junping

----- Original Message -----
From: "Michael Segel" <mi...@hotmail.com>
To: common-user@hadoop.apache.org
Cc: hadoop-user@lucene.apache.org
Sent: Thursday, October 3, 2013 8:23:58 PM
Subject: Re: rack awarness unexpected behaviour

Marc, 

The rack aware script is an artificial concept. Meaning you can tell which machine is in which rack and that may or may not reflect where the machine is actually located. 
The idea is to balance the number of nodes in the racks, at least on paper.  So you can have 14 machines in rack 1, and 16 machines in rack 2 even though they may physically be 20 machines in rack 1 and 10 machines in rack 2.

HTH

-Mike

On Oct 3, 2013, at 2:52 AM, Marc Sturlese <ma...@gmail.com> wrote:

> I've check it out and it works like that. The problem is, if the two racks
> have not the same capacity, one will have the disk space filled up much
> faster than the other (that's what I'm seeing).
> If one rack (rack A) has 2 servers of 8 cores with 4 reduce slots each and
> the other rack (rack B) has 2 servers of 16 cores with 8 reduce slots each,
> rack A will get filled up faster as rack B is writing more (because has more
> reduce slots).
> 
> Could a solution be to modify the bash script used to decide to which
> replica write a block? Would use probability and give to rack B double
> chance to receive de write.
> 
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/rack-awareness-unexpected-behaviour-tp4086029p4093270.html
> Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
> 

The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. 
Use at your own risk. 
Michael Segel
michael_segel (AT) hotmail.com

Re: rack awarness unexpected behaviour

Posted by Michel Segel <mi...@hotmail.com>.

And that's the rub.
Rack awareness is an artificial construct.

You want to fix it and match the real world, you need to balance the racks physically.
Otherwise you need to rewrite load balancing to take in to consideration the number and power of the nodes in the rack. 

The short answer, it's easier to fudge the values in the script.

Sent from a remote device. Please excuse any typos...

Mike Segel

> On Oct 3, 2013, at 8:58 AM, Marc Sturlese <ma...@gmail.com> wrote:
> 
> Doing that will balance the block writing but I think here you loose the
> concept of physical rack awareness.
> Let's say you have 2 physical racks, one with 2 servers and one with 4. If
> you artificially tell hadoop that one rack has 3 servers and the other 3 you
> are loosing the concept of rack awareness. You're not guaranteeing that each
> physical rack contains at least a replica of each block.
> 
> So if you have 2 racks with different number of servers, it's not possible
> to do proper rack awareness without filling the disks of the rack with less
> servers first. Am I right or am I missing something?
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/rack-awareness-unexpected-behaviour-tp4086029p4093337.html
> Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
>

Re: rack awarness unexpected behaviour

Posted by Marc Sturlese <ma...@gmail.com>.

Doing that will balance the block writing but I think here you loose the
concept of physical rack awareness.
Let's say you have 2 physical racks, one with 2 servers and one with 4. If
you artificially tell hadoop that one rack has 3 servers and the other 3 you
are loosing the concept of rack awareness. You're not guaranteeing that each
physical rack contains at least a replica of each block.

So if you have 2 racks with different number of servers, it's not possible
to do proper rack awareness without filling the disks of the rack with less
servers first. Am I right or am I missing something?



--
View this message in context: http://lucene.472066.n3.nabble.com/rack-awareness-unexpected-behaviour-tp4086029p4093337.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.

Re: rack awarness unexpected behaviour

Posted by Michael Segel <mi...@hotmail.com>.

Marc, 

The rack aware script is an artificial concept. Meaning you can tell which machine is in which rack and that may or may not reflect where the machine is actually located. 
The idea is to balance the number of nodes in the racks, at least on paper.  So you can have 14 machines in rack 1, and 16 machines in rack 2 even though they may physically be 20 machines in rack 1 and 10 machines in rack 2.

HTH

-Mike

On Oct 3, 2013, at 2:52 AM, Marc Sturlese <ma...@gmail.com> wrote:

> I've check it out and it works like that. The problem is, if the two racks
> have not the same capacity, one will have the disk space filled up much
> faster than the other (that's what I'm seeing).
> If one rack (rack A) has 2 servers of 8 cores with 4 reduce slots each and
> the other rack (rack B) has 2 servers of 16 cores with 8 reduce slots each,
> rack A will get filled up faster as rack B is writing more (because has more
> reduce slots).
> 
> Could a solution be to modify the bash script used to decide to which
> replica write a block? Would use probability and give to rack B double
> chance to receive de write.
> 
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/rack-awareness-unexpected-behaviour-tp4086029p4093270.html
> Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
> 

The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. 
Use at your own risk. 
Michael Segel
michael_segel (AT) hotmail.com

Re: rack awarness unexpected behaviour

Posted by Marc Sturlese <ma...@gmail.com>.

I've check it out and it works like that. The problem is, if the two racks
have not the same capacity, one will have the disk space filled up much
faster than the other (that's what I'm seeing).
If one rack (rack A) has 2 servers of 8 cores with 4 reduce slots each and
the other rack (rack B) has 2 servers of 16 cores with 8 reduce slots each,
rack A will get filled up faster as rack B is writing more (because has more
reduce slots).

Could a solution be to modify the bash script used to decide to which
replica write a block? Would use probability and give to rack B double
chance to receive de write.




--
View this message in context: http://lucene.472066.n3.nabble.com/rack-awareness-unexpected-behaviour-tp4086029p4093270.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.

Re: rack awarness unexpected behaviour

Posted by Jun Ping Du <jd...@vmware.com>.

For 3 replicas, the replication sequence is: 1st on local node of Writer, 2nd on remote rack node of 1st replica, 3rd on same rack of 2nd replica.
There could be some special cases like: disk is full on 1st node, or no node available for 2nd replica rack, and Hadoop already take care it well. Agree with Harsh, you should check if tasks are evenly distributed across two racks first.

Thanks,

Junping

----- Original Message -----
From: "Michel Segel" <mi...@hotmail.com>
To: common-user@hadoop.apache.org
Cc: common-user@hadoop.apache.org, hadoop-user@lucene.apache.org
Sent: Thursday, August 22, 2013 6:57:15 PM
Subject: Re: rack awarness unexpected behaviour

Rack aware is an artificial concept.
Meaning you can define where a node is regardless of is real position in the rack.

Going from memory, and its probably been changed in later versions of the code...

Isn't the replication... Copy on node 1, copy on same rack, third copy on different rack?

Or has this been improved upon?

Sent from a remote device. Please excuse any typos...

Mike Segel

On Aug 22, 2013, at 5:14 AM, Harsh J <ha...@cloudera.com> wrote:

> I'm not aware of a bug in 0.20.2 that would not honor the Rack
> Awareness, but have you done the two below checks as well?
> 
> 1. Ensuring JT has the same rack awareness scripts and configuration
> so it can use it for scheduling, and,
> 2. Checking if the map and reduce tasks are being evenly spread across
> both racks.
> 
> On Thu, Aug 22, 2013 at 2:50 PM, Marc Sturlese <ma...@gmail.com> wrote:
>> I'm on cdh3u4 (0.20.2), gonna try to read a bit on this bug
>> 
>> 
>> 
>> --
>> View this message in context: http://lucene.472066.n3.nabble.com/rack-awareness-unexpected-behaviour-tp4086029p4086049.html
>> Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
> 
> 
> 
> -- 
> Harsh J
>

Re: rack awarness unexpected behaviour

Posted by Michel Segel <mi...@hotmail.com>.

Rack aware is an artificial concept.
Meaning you can define where a node is regardless of is real position in the rack.

Going from memory, and its probably been changed in later versions of the code...

Isn't the replication... Copy on node 1, copy on same rack, third copy on different rack?

Or has this been improved upon?

Sent from a remote device. Please excuse any typos...

Mike Segel

On Aug 22, 2013, at 5:14 AM, Harsh J <ha...@cloudera.com> wrote:

> I'm not aware of a bug in 0.20.2 that would not honor the Rack
> Awareness, but have you done the two below checks as well?
> 
> 1. Ensuring JT has the same rack awareness scripts and configuration
> so it can use it for scheduling, and,
> 2. Checking if the map and reduce tasks are being evenly spread across
> both racks.
> 
> On Thu, Aug 22, 2013 at 2:50 PM, Marc Sturlese <ma...@gmail.com> wrote:
>> I'm on cdh3u4 (0.20.2), gonna try to read a bit on this bug
>> 
>> 
>> 
>> --
>> View this message in context: http://lucene.472066.n3.nabble.com/rack-awareness-unexpected-behaviour-tp4086029p4086049.html
>> Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
> 
> 
> 
> -- 
> Harsh J
>

Re: rack awarness unexpected behaviour

Posted by Harsh J <ha...@cloudera.com>.

I'm not aware of a bug in 0.20.2 that would not honor the Rack
Awareness, but have you done the two below checks as well?

1. Ensuring JT has the same rack awareness scripts and configuration
so it can use it for scheduling, and,
2. Checking if the map and reduce tasks are being evenly spread across
both racks.

On Thu, Aug 22, 2013 at 2:50 PM, Marc Sturlese <ma...@gmail.com> wrote:
> I'm on cdh3u4 (0.20.2), gonna try to read a bit on this bug
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/rack-awareness-unexpected-behaviour-tp4086029p4086049.html
> Sent from the Hadoop lucene-users mailing list archive at Nabble.com.

-- 
Harsh J

Re: rack awarness unexpected behaviour

Posted by Marc Sturlese <ma...@gmail.com>.

I'm on cdh3u4 (0.20.2), gonna try to read a bit on this bug



--
View this message in context: http://lucene.472066.n3.nabble.com/rack-awareness-unexpected-behaviour-tp4086029p4086049.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.

Re: rack awarness unexpected behaviour

Posted by Nicolas Liochon <nk...@gmail.com>.

When you rebalance, the block is fully written, so the writer locality does
not have to be taken into account (there is no writer anymore), hence it
can rebalance across the racks. That's why jobs asymmetry was the easy
guess. What's your hadoop version by the way? I remember a bug around rack
awareness, but it has been fixed a year ago (and I'm not sure it would have
had this effect).

On Thu, Aug 22, 2013 at 10:29 AM, Marc Sturlese <ma...@gmail.com>wrote:

> Jobs run on the whole cluster. After rebalancing everything is properly
> allocated. Then I start running jobs using all the slots of the 2 racks and
> the problem starts to happen.
> Maybe I'm missing something. When using the rack awareness, do you have to
> specify to the jobs to run in slots form both racks and not just one? (I
> guess not)
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/rack-awareness-unexpected-behaviour-tp4086029p4086038.html
> Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
>

Re: rack awarness unexpected behaviour

Posted by Marc Sturlese <ma...@gmail.com>.

Jobs run on the whole cluster. After rebalancing everything is properly
allocated. Then I start running jobs using all the slots of the 2 racks and
the problem starts to happen.
Maybe I'm missing something. When using the rack awareness, do you have to
specify to the jobs to run in slots form both racks and not just one? (I
guess not)



--
View this message in context: http://lucene.472066.n3.nabble.com/rack-awareness-unexpected-behaviour-tp4086029p4086038.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.

Re: rack awarness unexpected behaviour

Posted by Nicolas Liochon <nk...@gmail.com>.

Do the jobs run on the whole cluster or a single rack?
If you write from a single rack, you will get something similar to what you
described, because the default policy is to put one block locally and 2
blocks on the same remote rack. It does check that there is enough place
available, but does not try to balance.

On Thu, Aug 22, 2013 at 9:41 AM, Marc Sturlese <ma...@gmail.com>wrote:

> Hey there,
> I've set up rack awareness on my hadoop cluster with replication 3. I have
> 2
> racks and each contains 50% of the nodes.
> I can see that the blocks are spread on the 2 racks, the problem is that
> all
> nodes from a rack are storing 2 replicas and the nodes of the other rack
> just one. If I launch the hadoop balancer script, it will properly spread
> the replicas across the 2 racks, leaving all nodes with exactly the same
> available disk space but, after jobs are running for hours, the data will
> be
> unbalanced again (rack1 having all nodes with less empty disk space than
> all
> nodes from rack2)
>
> Any clue whats going on?
> Thanks in advance
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/rack-awarness-unexpected-behaviour-tp4086029.html
> Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
>