You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Hairong Kuang (JIRA)" <ji...@apache.org> on 2006/11/08 20:40:53 UTC
[jira] Commented: (HADOOP-692) Rack-aware Replica Placement

    [ http://issues.apache.org/jira/browse/HADOOP-692?page=comments#action_12448244 ] 
            
Hairong Kuang commented on HADOOP-692:
--------------------------------------

I plan to address the issue from three areas:

* Determine rack id
 
Each data node gets its rack id from the command line. Data node supports an option "-r <rack id>" or "—rack <rack id>". A rack id is a unique string representation of the rack that the data node belongs to. It usually consists of a name/ip plus its port number of the switch to which the data node directly connects. If the option is not set, the data node belongs to a default rack.

How to get the rack id is proprietary to each organization. So a mechanism needs to be provided when starting HDFS, for example, a script which prints the rack id on the screen. The output is feed to a data node when it starts. 

The rack id is then stored in DatanodeID  and sent to the name node as part of the registration information.

* Maintain rackid-to-datanodes map

A name node maintains a rack id to data node descriptors map that maps a rack id to a list of data nodes that belong to the rack. When the name node receives a registration message from a data node, it first check if the map already has an entry for the data node. If yes, it removes the old entry. It then adds a new entry. When the name node removes a data node, the data node entry in the map is also removed.

If all data nodes start without providing a rack id, the map contains one default rack id mapping to all the data nodes in the system. So HDFS will behave the same as it is now.

* Place replicas

A simple policy is to place replicas across racks. This prevents losing data when an entire rack fails and allows to make use of bandwidth from multiple racks when read a file. This policy evenly distribution of replicas in the cluster and thus makes it easy to balance load when a map/reduce job with the data as input is scheduled to run. However, the cost of the policy is the write expense that a block needs to write to multiple racks. One minor optimization is to place the one replica in the local node, where the writer is located. If not, place it in a different node at the local rack.

For the most common case when the replica factor is three,  another possible placement policy is to place one replica in the local node, place one in a different node at the local rack, and place one at a different rack. This policy cuts the out-of-rack write traffic and hence improves write performance. Because the chance of rack failure is far less than that of node failure, it does not effect the data reliability and availability much. But it reduces the aggregate bandwidth of network bandwith when read a file since now a block is placed in two racks rather than three. The replicas of a file do not evenly distribute across the racks evenly. One third is at one node, two thirds are on one rack, the other one third  is evenly distributed. But a map/reduce job still gets a chance to balance its load because each block has one replica that is placed in a random rack. Overall I feel that this placement policy has a good trade-off. It greatly improves write performance while not at a great cost of data reliability or read performance. 

> Rack-aware Replica Placement
> ----------------------------
>
>                 Key: HADOOP-692
>                 URL: http://issues.apache.org/jira/browse/HADOOP-692
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.8.0
>            Reporter: Hairong Kuang
>         Assigned To: Hairong Kuang
>             Fix For: 0.9.0
>
>
> This issue assumes that HDFS runs on a cluster of computers that spread across many racks. Communication between two nodes on different racks needs to go through switches. Bandwidth in/out of a rack may be less than the total bandwidth of machines in the rack. The purpose of rack-aware replica placement is to improve data reliability, availability, and network bandwidth utilization. The basic idea is that each data node determines to which rack it belongs at the startup time and notifies the name node of the rack id upon registration. The name node maintains a rackid-to-datanode map and tries to place replicas across racks.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira