You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Hairong Kuang (JIRA)" <ji...@apache.org> on 2007/07/25 20:47:31 UTC

[jira] Created: (HADOOP-1652) Rebalance data blocks when new data nodes added or data nodes become full

Rebalance data blocks when new data nodes added or data nodes become full
-------------------------------------------------------------------------

Key: HADOOP-1652
URL: https://issues.apache.org/jira/browse/HADOOP-1652
Project: Hadoop
Issue Type: New Feature
Components: dfs
Affects Versions: 0.13.0
Reporter: Hairong Kuang
Assignee: Hairong Kuang
Fix For: 0.15.0

When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism.

This jira aims to find an approach to redistribute data blocks when imbalance occurs in the cluster. An solution should meet the following requirements:
1. It maintains data availablility guranteens in the sense that rebalancing does not reduce the number of replicas that a block has or the number of racks that the block resides.
2. An adminstrator should be able to invoke and interrupt rebalancing from a command line.
3. Rebalancing should be throttled so that rebalancing does not cause a namenode to be too busy to serve any incoming request or saturate the network.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1652) Rebalance data blocks when new data nodes added or data nodes become full

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-1652:
----------------------------------

    Status: Patch Available  (was: Open)

> Rebalance data blocks when new data nodes added or data nodes become full
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-1652
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1652
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.13.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: balancer.patch, balancer1.patch, balancer2.patch, balancer3.patch, balancer4.patch, balancer5.patch, balancer6.patch, BalancerAdminGuide.pdf, BalancerAdminGuide1.pdf, BalancerUserGuide2.pdf, RebalanceDesign4.pdf, RebalanceDesign5.pdf, RebalanceDesign6.pdf
>
>
> When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism. 
> This jira aims to find an approach to redistribute data blocks when imbalance occurs in the cluster.  An solution should meet the following requirements:
> 1. It maintains data availablility guranteens in the sense that rebalancing does not reduce the number of replicas that a block has or the number of racks that the block resides.
> 2. An adminstrator should be able to invoke and interrupt rebalancing from a command line.
> 3. Rebalancing should be throttled so that rebalancing does not cause a namenode to be too busy to serve any incoming request or saturate the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1652) Rebalance data blocks when new data nodes added or data nodes become full

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515423 ] 

Hairong Kuang commented on HADOOP-1652:
---------------------------------------

Here are some of my initial thoughts. Please comment.

1. What's balance?
A cluster is balanced iff there is no under-capactiy or over-capacity data nodes in the cluster.
An under-capacity data node is a node that its %used space is less than avg_%used_space-threshhold.
An over-capacity data node is a node that its %used space is greater than avg_%used_space+threshhold. 
A threshold is user configurable. A default value could be 20% of % used space.

2. When to rebalance?
Rebanlancing is performed on demand. An administrator issues a command to trigger rebalancing. Rebalancing automatically shuts off once the cluster is balanced and can also be interrupted by an administrator. The following commands are to be supported:
Hadoop dfsadmin balance <start/stop/get>
                  -----Start/stop data block rebalancing or query its status. 

3. How to balance?
    (a) Upon receiving a data block rebalancing request, a name node creates a Balancing thread. 
    (b) The thread performs rebalancing iteratively. 
          (1) At each iteration, it scans the whole data node list and schedules block moving tasks. It sleeps for a heartbeat interval between iterations;
          (2) When scanning the data node list, if it finds an under-capacity data node, it schedules moving blocks to the data node. The source data node is chosen randomly from over-capacity data nodes or non-under-capacity data nodes if no over-capacity data node exists. The source block is randomly chosen from the source data node as long as the block moving does not violate requirement (1).
          (3) If the thread finds an over-capacity data node, it scheduls moving blocks from the data node to other data nodes. It chooses a target data node randomly from under-capacity data nodes or non-over-capcity data nodes when there is no under-capacity data node; It then randomly chooses a source block that does not violate requirement (1). 
          (4) The scheduled tasks are put to a queue in the source data node. The task queue has a limited length of 4 by default and is configurable.
          (5) The scheduled tasks are sent to data nodes to execute in responding to a heartbeat message. Currently dfs limits at most 2 tasks per heartbeat by default.
    (c) The thread stops and frees itself when the cluster becomes balanced.

> Rebalance data blocks when new data nodes added or data nodes become full
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-1652
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1652
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.13.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.15.0
>
>
> When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism. 
> This jira aims to find an approach to redistribute data blocks when imbalance occurs in the cluster.  An solution should meet the following requirements:
> 1. It maintains data availablility guranteens in the sense that rebalancing does not reduce the number of replicas that a block has or the number of racks that the block resides.
> 2. An adminstrator should be able to invoke and interrupt rebalancing from a command line.
> 3. Rebalancing should be throttled so that rebalancing does not cause a namenode to be too busy to serve any incoming request or saturate the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1652) Rebalance data blocks when new data nodes added or data nodes become full

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-1652:
----------------------------------

    Attachment: balancer5.patch

The patch incorporates last review comments from Sanjay.

> Rebalance data blocks when new data nodes added or data nodes become full
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-1652
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1652
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.13.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: balancer.patch, balancer1.patch, balancer2.patch, balancer3.patch, balancer4.patch, balancer5.patch, BalancerAdminGuide.pdf, BalancerAdminGuide1.pdf, RebalanceDesign4.pdf, RebalanceDesign5.pdf, RebalanceDesign6.pdf
>
>
> When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism. 
> This jira aims to find an approach to redistribute data blocks when imbalance occurs in the cluster.  An solution should meet the following requirements:
> 1. It maintains data availablility guranteens in the sense that rebalancing does not reduce the number of replicas that a block has or the number of racks that the block resides.
> 2. An adminstrator should be able to invoke and interrupt rebalancing from a command line.
> 3. Rebalancing should be throttled so that rebalancing does not cause a namenode to be too busy to serve any incoming request or saturate the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-1652) Rebalance data blocks when new data nodes added or data nodes become full

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515423 ] 

Hairong Kuang edited comment on HADOOP-1652 at 7/25/07 2:28 PM:
----------------------------------------------------------------

Here are some of my initial thoughts. Please comment.

1. What's balance?
A cluster is balanced iff there is no under-capactiy or over-capacity data nodes in the cluster.
An under-capacity data node is a node that its %used space is less than avg_%used_space-threshhold.
An over-capacity data node is a node that its %used space is greater than avg_%used_space+threshhold. 
A threshold is user configurable. A default value could be 20% of % used space.

2. When to rebalance?
Rebanlancing is performed on demand. An administrator issues a command to trigger rebalancing. Rebalancing automatically shuts off once the cluster is balanced and can also be interrupted by an administrator. The following commands are to be supported:
Hadoop dfsadmin balance <start/stop/get>
                  -----Start/stop data block rebalancing or query its status. 

3. How to balance?
    * Upon receiving a data block rebalancing request, a name node creates a Balancing thread. 
    * The thread performs rebalancing iteratively. 
          # At each iteration, it scans the whole data node list and schedules block moving tasks. It sleeps for a heartbeat interval between iterations;
          # When scanning the data node list, if it finds an under-capacity data node, it schedules moving blocks to the data node. The source data node is chosen randomly from over-capacity data nodes or non-under-capacity data nodes if no over-capacity data node exists. The source block is randomly chosen from the source data node as long as the block moving does not violate requirement (1).
          # If the thread finds an over-capacity data node, it scheduls moving blocks from the data node to other data nodes. It chooses a target data node randomly from under-capacity data nodes or non-over-capcity data nodes when there is no under-capacity data node; It then randomly chooses a source block that does not violate requirement (1). 
          # The scheduled tasks are put to a queue in the source data node. The task queue has a limited length of 4 by default and is configurable.
          # The scheduled tasks are sent to data nodes to execute in responding to a heartbeat message. Currently dfs limits at most 2 tasks per heartbeat by default.
    * The thread stops and frees itself when the cluster becomes balanced.


 was:
Here are some of my initial thoughts. Please comment.

1. What's balance?
A cluster is balanced iff there is no under-capactiy or over-capacity data nodes in the cluster.
An under-capacity data node is a node that its %used space is less than avg_%used_space-threshhold.
An over-capacity data node is a node that its %used space is greater than avg_%used_space+threshhold. 
A threshold is user configurable. A default value could be 20% of % used space.

2. When to rebalance?
Rebanlancing is performed on demand. An administrator issues a command to trigger rebalancing. Rebalancing automatically shuts off once the cluster is balanced and can also be interrupted by an administrator. The following commands are to be supported:
Hadoop dfsadmin balance <start/stop/get>
                  -----Start/stop data block rebalancing or query its status. 

3. How to balance?
    (a) Upon receiving a data block rebalancing request, a name node creates a Balancing thread. 
    (b) The thread performs rebalancing iteratively. 
          (1) At each iteration, it scans the whole data node list and schedules block moving tasks. It sleeps for a heartbeat interval between iterations;
          (2) When scanning the data node list, if it finds an under-capacity data node, it schedules moving blocks to the data node. The source data node is chosen randomly from over-capacity data nodes or non-under-capacity data nodes if no over-capacity data node exists. The source block is randomly chosen from the source data node as long as the block moving does not violate requirement (1).
          (3) If the thread finds an over-capacity data node, it scheduls moving blocks from the data node to other data nodes. It chooses a target data node randomly from under-capacity data nodes or non-over-capcity data nodes when there is no under-capacity data node; It then randomly chooses a source block that does not violate requirement (1). 
          (4) The scheduled tasks are put to a queue in the source data node. The task queue has a limited length of 4 by default and is configurable.
          (5) The scheduled tasks are sent to data nodes to execute in responding to a heartbeat message. Currently dfs limits at most 2 tasks per heartbeat by default.
    (c) The thread stops and frees itself when the cluster becomes balanced.

> Rebalance data blocks when new data nodes added or data nodes become full
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-1652
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1652
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.13.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.15.0
>
>
> When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism. 
> This jira aims to find an approach to redistribute data blocks when imbalance occurs in the cluster.  An solution should meet the following requirements:
> 1. It maintains data availablility guranteens in the sense that rebalancing does not reduce the number of replicas that a block has or the number of racks that the block resides.
> 2. An adminstrator should be able to invoke and interrupt rebalancing from a command line.
> 3. Rebalancing should be throttled so that rebalancing does not cause a namenode to be too busy to serve any incoming request or saturate the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1652) Rebalance data blocks when new data nodes added or data nodes become full

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515441 ] 

Hairong Kuang commented on HADOOP-1652:
---------------------------------------

Some more thoughts for discussion...

1. Put the cluster in safe mode while rebalacing. This allows us to more aggressively schedules block moving tasks but it interrupts the current running of the cluster.
2. Spawn a seprate process on the client side to do all the scheduling work. A name node ships a snapshot of all data node descriptors & all blocks to the process in the begining. In the end, the process sends all the scheduled tasks back to the namenode. This approach does not interrupt namenode work but it requires shipping large amount of data from namenode in the beginning & then to namenode in the end.

> Rebalance data blocks when new data nodes added or data nodes become full
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-1652
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1652
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.13.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.15.0
>
>
> When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism. 
> This jira aims to find an approach to redistribute data blocks when imbalance occurs in the cluster.  An solution should meet the following requirements:
> 1. It maintains data availablility guranteens in the sense that rebalancing does not reduce the number of replicas that a block has or the number of racks that the block resides.
> 2. An adminstrator should be able to invoke and interrupt rebalancing from a command line.
> 3. Rebalancing should be throttled so that rebalancing does not cause a namenode to be too busy to serve any incoming request or saturate the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1652) Rebalance data blocks when new data nodes added or data nodes become full

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515836 ] 

Hairong Kuang commented on HADOOP-1652:
---------------------------------------

When there is no over-capacity data node in the cluster, a source data node should be chosen from data nodes whose %used space is above average %used space, among which we should favor data nodes that are on the same rack as the target data node. I am not sure if we should favor fuller data nodes or not because it is more expensive than a random selection, but it makes the algorithm to converge faster. This raised another question that how we can guarantee that the algorithm converges if rebalancing is not done in the safe mode.

> Rebalance data blocks when new data nodes added or data nodes become full
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-1652
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1652
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.13.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.15.0
>
>
> When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism. 
> This jira aims to find an approach to redistribute data blocks when imbalance occurs in the cluster.  An solution should meet the following requirements:
> 1. It maintains data availablility guranteens in the sense that rebalancing does not reduce the number of replicas that a block has or the number of racks that the block resides.
> 2. An adminstrator should be able to invoke and interrupt rebalancing from a command line.
> 3. Rebalancing should be throttled so that rebalancing does not cause a namenode to be too busy to serve any incoming request or saturate the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1652) Rebalance data blocks when new data nodes added or data nodes become full

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515518 ] 

Enis Soztutar commented on HADOOP-1652:
---------------------------------------

A balanced cluster, in terms of disk space usage, should be one in which the percentage of used disk space is balanced. Please take a look at HADOOP-1530. 

> Rebalance data blocks when new data nodes added or data nodes become full
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-1652
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1652
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.13.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.15.0
>
>
> When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism. 
> This jira aims to find an approach to redistribute data blocks when imbalance occurs in the cluster.  An solution should meet the following requirements:
> 1. It maintains data availablility guranteens in the sense that rebalancing does not reduce the number of replicas that a block has or the number of racks that the block resides.
> 2. An adminstrator should be able to invoke and interrupt rebalancing from a command line.
> 3. Rebalancing should be throttled so that rebalancing does not cause a namenode to be too busy to serve any incoming request or saturate the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1652) Rebalance data blocks when new data nodes added or data nodes become full

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-1652:
----------------------------------

    Attachment: balancer6.patch

The patch fixed the findbugs errors.

> Rebalance data blocks when new data nodes added or data nodes become full
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-1652
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1652
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.13.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: balancer.patch, balancer1.patch, balancer2.patch, balancer3.patch, balancer4.patch, balancer5.patch, balancer6.patch, BalancerAdminGuide.pdf, BalancerAdminGuide1.pdf, BalancerUserGuide2.pdf, RebalanceDesign4.pdf, RebalanceDesign5.pdf, RebalanceDesign6.pdf
>
>
> When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism. 
> This jira aims to find an approach to redistribute data blocks when imbalance occurs in the cluster.  An solution should meet the following requirements:
> 1. It maintains data availablility guranteens in the sense that rebalancing does not reduce the number of replicas that a block has or the number of racks that the block resides.
> 2. An adminstrator should be able to invoke and interrupt rebalancing from a command line.
> 3. Rebalancing should be throttled so that rebalancing does not cause a namenode to be too busy to serve any incoming request or saturate the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1652) Rebalance data blocks when new data nodes added or data nodes become full

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-1652:
----------------------------------

    Attachment: RebalanceDesign5.pdf

Attached is a new version of the design document which includes the following changes:
1. A block is pulled by a destination instead of pushed from a source.
2. A block can be pulled from a proxy source which contains the block and is closer to the destination or less loaded than the source.
3. A source does not delete a block by itself. Instead a destination notifies the name node that a block is copied and a hint that a replica at the source should be removed if the block is over replicated. When the name node decides to remove an excessive replica, it favors the hint as long as doing so does not violate the rebalancing requirement 1. A source removes the block upon a request from the name node.

> Rebalance data blocks when new data nodes added or data nodes become full
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-1652
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1652
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.13.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.15.0
>
>         Attachments: RebalanceDesign4.pdf, RebalanceDesign5.pdf
>
>
> When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism. 
> This jira aims to find an approach to redistribute data blocks when imbalance occurs in the cluster.  An solution should meet the following requirements:
> 1. It maintains data availablility guranteens in the sense that rebalancing does not reduce the number of replicas that a block has or the number of racks that the block resides.
> 2. An adminstrator should be able to invoke and interrupt rebalancing from a command line.
> 3. Rebalancing should be throttled so that rebalancing does not cause a namenode to be too busy to serve any incoming request or saturate the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1652) Rebalance data blocks when new data nodes added or data nodes become full

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515859 ] 

Stu Hood commented on HADOOP-1652:
----------------------------------

{quote}
...among which we should favor data nodes that are on the same rack as the target data node.
{quote}

This is a tradeoff though... one of the reasons for recording what racks blocks are on is to prevent putting "all of your eggs in one basket" so to speak (See the middle paragraph of http://lucene.apache.org/hadoop/hdfs_design.html#Data+Replication ).

> Rebalance data blocks when new data nodes added or data nodes become full
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-1652
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1652
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.13.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.15.0
>
>
> When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism. 
> This jira aims to find an approach to redistribute data blocks when imbalance occurs in the cluster.  An solution should meet the following requirements:
> 1. It maintains data availablility guranteens in the sense that rebalancing does not reduce the number of replicas that a block has or the number of racks that the block resides.
> 2. An adminstrator should be able to invoke and interrupt rebalancing from a command line.
> 3. Rebalancing should be throttled so that rebalancing does not cause a namenode to be too busy to serve any incoming request or saturate the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1652) Rebalance data blocks when new data nodes added or data nodes become full

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-1652:
----------------------------------

    Attachment: balancer2.patch

This patch incorported the second round of review comments from Sanjay. Particularly it adds the following features:
1. Disallow more than one balancer running in an HDFS;
2. Each balancing iteration runs no more than 20 minutes;
3. Disallow a block to move more than once during the whole process of balancing;
4. Each block move has a timeout;
5. Each line of output has a timestamp.

> Rebalance data blocks when new data nodes added or data nodes become full
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-1652
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1652
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.13.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: balancer.patch, balancer1.patch, balancer2.patch, BalancerAdminGuide.pdf, RebalanceDesign4.pdf, RebalanceDesign5.pdf, RebalanceDesign6.pdf
>
>
> When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism. 
> This jira aims to find an approach to redistribute data blocks when imbalance occurs in the cluster.  An solution should meet the following requirements:
> 1. It maintains data availablility guranteens in the sense that rebalancing does not reduce the number of replicas that a block has or the number of racks that the block resides.
> 2. An adminstrator should be able to invoke and interrupt rebalancing from a command line.
> 3. Rebalancing should be throttled so that rebalancing does not cause a namenode to be too busy to serve any incoming request or saturate the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1652) Rebalance data blocks when new data nodes added or data nodes become full

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12548475 ] 

Hadoop QA commented on HADOOP-1652:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
http://issues.apache.org/jira/secure/attachment/12370975/balancer6.patch
against trunk revision r601111.

    @author +1.  The patch does not contain any @author tags.

    javadoc +1.  The javadoc tool did not generate any warning messages.

    javac +1.  The applied patch does not generate any new compiler warnings.

    findbugs -1.  The patch appears to introduce 2 new Findbugs warnings.

    core tests +1.  The patch passed core unit tests.

    contrib tests +1.  The patch passed contrib unit tests.

Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1262/testReport/
Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1262/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1262/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1262/console

This message is automatically generated.

> Rebalance data blocks when new data nodes added or data nodes become full
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-1652
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1652
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.13.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: balancer.patch, balancer1.patch, balancer2.patch, balancer3.patch, balancer4.patch, balancer5.patch, balancer6.patch, BalancerAdminGuide.pdf, BalancerAdminGuide1.pdf, BalancerUserGuide2.pdf, RebalanceDesign4.pdf, RebalanceDesign5.pdf, RebalanceDesign6.pdf
>
>
> When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism. 
> This jira aims to find an approach to redistribute data blocks when imbalance occurs in the cluster.  An solution should meet the following requirements:
> 1. It maintains data availablility guranteens in the sense that rebalancing does not reduce the number of replicas that a block has or the number of racks that the block resides.
> 2. An adminstrator should be able to invoke and interrupt rebalancing from a command line.
> 3. Rebalancing should be throttled so that rebalancing does not cause a namenode to be too busy to serve any incoming request or saturate the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1652) Rebalance data blocks when new data nodes added or data nodes become full

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-1652:
----------------------------------

    Attachment: balancer8.patch

The patch has a minor change to make the junit test to run faster.

> Rebalance data blocks when new data nodes added or data nodes become full
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-1652
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1652
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.13.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: balancer.patch, balancer1.patch, balancer2.patch, balancer3.patch, balancer4.patch, balancer5.patch, balancer6.patch, balancer7.patch, balancer8.patch, BalancerAdminGuide.pdf, BalancerAdminGuide1.pdf, BalancerUserGuide2.pdf, RebalanceDesign4.pdf, RebalanceDesign5.pdf, RebalanceDesign6.pdf
>
>
> When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism. 
> This jira aims to find an approach to redistribute data blocks when imbalance occurs in the cluster.  An solution should meet the following requirements:
> 1. It maintains data availablility guranteens in the sense that rebalancing does not reduce the number of replicas that a block has or the number of racks that the block resides.
> 2. An adminstrator should be able to invoke and interrupt rebalancing from a command line.
> 3. Rebalancing should be throttled so that rebalancing does not cause a namenode to be too busy to serve any incoming request or saturate the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1652) Rebalance data blocks when new data nodes added or data nodes become full

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12548389 ] 

Hadoop QA commented on HADOOP-1652:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
http://issues.apache.org/jira/secure/attachment/12370962/balancer5.patch
against trunk revision r601038.

    @author +1.  The patch does not contain any @author tags.

    javadoc +1.  The javadoc tool did not generate any warning messages.

    javac +1.  The applied patch does not generate any new compiler warnings.

    findbugs -1.  The patch appears to introduce 9 new Findbugs warnings.

    core tests +1.  The patch passed core unit tests.

    contrib tests -1.  The patch failed contrib unit tests.

Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1259/testReport/
Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1259/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1259/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1259/console

This message is automatically generated.

> Rebalance data blocks when new data nodes added or data nodes become full
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-1652
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1652
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.13.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: balancer.patch, balancer1.patch, balancer2.patch, balancer3.patch, balancer4.patch, balancer5.patch, BalancerAdminGuide.pdf, BalancerAdminGuide1.pdf, BalancerUserGuide2.pdf, RebalanceDesign4.pdf, RebalanceDesign5.pdf, RebalanceDesign6.pdf
>
>
> When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism. 
> This jira aims to find an approach to redistribute data blocks when imbalance occurs in the cluster.  An solution should meet the following requirements:
> 1. It maintains data availablility guranteens in the sense that rebalancing does not reduce the number of replicas that a block has or the number of racks that the block resides.
> 2. An adminstrator should be able to invoke and interrupt rebalancing from a command line.
> 3. Rebalancing should be throttled so that rebalancing does not cause a namenode to be too busy to serve any incoming request or saturate the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1652) Rebalance data blocks when new data nodes added or data nodes become full

Posted by "Koji Noguchi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12538631 ] 

Koji Noguchi commented on HADOOP-1652:
--------------------------------------

Should we perform rebalance if the cluster is not 'finalizeUpgrade' d?

If yes, maybe one test case for this?

> Rebalance data blocks when new data nodes added or data nodes become full
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-1652
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1652
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.13.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: balancer.patch, RebalanceDesign4.pdf, RebalanceDesign5.pdf, RebalanceDesign6.pdf
>
>
> When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism. 
> This jira aims to find an approach to redistribute data blocks when imbalance occurs in the cluster.  An solution should meet the following requirements:
> 1. It maintains data availablility guranteens in the sense that rebalancing does not reduce the number of replicas that a block has or the number of racks that the block resides.
> 2. An adminstrator should be able to invoke and interrupt rebalancing from a command line.
> 3. Rebalancing should be throttled so that rebalancing does not cause a namenode to be too busy to serve any incoming request or saturate the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1652) Rebalance data blocks when new data nodes added or data nodes become full

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-1652:
----------------------------------

    Attachment: BalancerAdminGuide.pdf

The balancer administrator guide is attached.

> Rebalance data blocks when new data nodes added or data nodes become full
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-1652
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1652
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.13.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: balancer.patch, BalancerAdminGuide.pdf, RebalanceDesign4.pdf, RebalanceDesign5.pdf, RebalanceDesign6.pdf
>
>
> When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism. 
> This jira aims to find an approach to redistribute data blocks when imbalance occurs in the cluster.  An solution should meet the following requirements:
> 1. It maintains data availablility guranteens in the sense that rebalancing does not reduce the number of replicas that a block has or the number of racks that the block resides.
> 2. An adminstrator should be able to invoke and interrupt rebalancing from a command line.
> 3. Rebalancing should be throttled so that rebalancing does not cause a namenode to be too busy to serve any incoming request or saturate the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-1652) Rebalance data blocks when new data nodes added or data nodes become full

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12537167 ] 

hairong edited comment on HADOOP-1652 at 10/23/07 3:48 PM:
-----------------------------------------------------------------

Upload the most updated design document.

      was (Author: hairong):
    Upload the most updated design documents.
  
> Rebalance data blocks when new data nodes added or data nodes become full
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-1652
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1652
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.13.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: RebalanceDesign4.pdf, RebalanceDesign5.pdf, RebalanceDesign6.pdf
>
>
> When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism. 
> This jira aims to find an approach to redistribute data blocks when imbalance occurs in the cluster.  An solution should meet the following requirements:
> 1. It maintains data availablility guranteens in the sense that rebalancing does not reduce the number of replicas that a block has or the number of racks that the block resides.
> 2. An adminstrator should be able to invoke and interrupt rebalancing from a command line.
> 3. Rebalancing should be throttled so that rebalancing does not cause a namenode to be too busy to serve any incoming request or saturate the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1652) Rebalance data blocks when new data nodes added or data nodes become full

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515950 ] 

Arun C Murthy commented on HADOOP-1652:
---------------------------------------

I'd say, semantically, that a stronger definition of a _balanced cluster_ is one in which the percentage of *free disk-space* (across all available partitions, per-node) is balanced. 

I know, it's a bit of a word-play, but yet ... *smile*

> Rebalance data blocks when new data nodes added or data nodes become full
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-1652
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1652
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.13.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.15.0
>
>
> When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism. 
> This jira aims to find an approach to redistribute data blocks when imbalance occurs in the cluster.  An solution should meet the following requirements:
> 1. It maintains data availablility guranteens in the sense that rebalancing does not reduce the number of replicas that a block has or the number of racks that the block resides.
> 2. An adminstrator should be able to invoke and interrupt rebalancing from a command line.
> 3. Rebalancing should be throttled so that rebalancing does not cause a namenode to be too busy to serve any incoming request or saturate the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1652) Rebalance data blocks when new data nodes added or data nodes become full

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515909 ] 

Hairong Kuang commented on HADOOP-1652:
---------------------------------------

The reason that I'd like to favor data nodes that are on the same rack as the destination node is that dfs does not need to check if "all of your eggs are in one basket" :-) when moving a block from such a node to the destination. When I say "move", I meant replicating to the destination and removing from the source.

> Rebalance data blocks when new data nodes added or data nodes become full
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-1652
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1652
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.13.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.15.0
>
>
> When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism. 
> This jira aims to find an approach to redistribute data blocks when imbalance occurs in the cluster.  An solution should meet the following requirements:
> 1. It maintains data availablility guranteens in the sense that rebalancing does not reduce the number of replicas that a block has or the number of racks that the block resides.
> 2. An adminstrator should be able to invoke and interrupt rebalancing from a command line.
> 3. Rebalancing should be throttled so that rebalancing does not cause a namenode to be too busy to serve any incoming request or saturate the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1652) Rebalance data blocks when new data nodes added or data nodes become full

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-1652:
----------------------------------

    Attachment:     (was: balancer1.patch)

> Rebalance data blocks when new data nodes added or data nodes become full
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-1652
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1652
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.13.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: balancer.patch, balancer1.patch, BalancerAdminGuide.pdf, RebalanceDesign4.pdf, RebalanceDesign5.pdf, RebalanceDesign6.pdf
>
>
> When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism. 
> This jira aims to find an approach to redistribute data blocks when imbalance occurs in the cluster.  An solution should meet the following requirements:
> 1. It maintains data availablility guranteens in the sense that rebalancing does not reduce the number of replicas that a block has or the number of racks that the block resides.
> 2. An adminstrator should be able to invoke and interrupt rebalancing from a command line.
> 3. Rebalancing should be throttled so that rebalancing does not cause a namenode to be too busy to serve any incoming request or saturate the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1652) Rebalance data blocks when new data nodes added or data nodes become full

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-1652:
----------------------------------

    Attachment: balancer4.patch

This is a new patch that's built from the most recent trunk. It removed the part of the code that gets comitted in HADOOP-2256.

> Rebalance data blocks when new data nodes added or data nodes become full
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-1652
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1652
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.13.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: balancer.patch, balancer1.patch, balancer2.patch, balancer3.patch, balancer4.patch, BalancerAdminGuide.pdf, BalancerAdminGuide1.pdf, RebalanceDesign4.pdf, RebalanceDesign5.pdf, RebalanceDesign6.pdf
>
>
> When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism. 
> This jira aims to find an approach to redistribute data blocks when imbalance occurs in the cluster.  An solution should meet the following requirements:
> 1. It maintains data availablility guranteens in the sense that rebalancing does not reduce the number of replicas that a block has or the number of racks that the block resides.
> 2. An adminstrator should be able to invoke and interrupt rebalancing from a command line.
> 3. Rebalancing should be throttled so that rebalancing does not cause a namenode to be too busy to serve any incoming request or saturate the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-1652) Rebalance data blocks when new data nodes added or data nodes become full

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515423 ] 

Hairong Kuang edited comment on HADOOP-1652 at 7/25/07 2:30 PM:
----------------------------------------------------------------

Here are some of my initial thoughts. Please comment.

1. What's balance?
A cluster is balanced iff there is no under-capactiy or over-capacity data nodes in the cluster.
An under-capacity data node is a node that its %used space is less than avg_%used_space-threshhold.
An over-capacity data node is a node that its %used space is greater than avg_%used_space+threshhold. 
A threshold is user configurable. A default value could be 20% of % used space.

2. When to rebalance?
Rebanlancing is performed on demand. An administrator issues a command to trigger rebalancing. Rebalancing automatically shuts off once the cluster is balanced and can also be interrupted by an administrator. The following commands are to be supported:
Hadoop dfsadmin balance <start/stop/get>
                  -----Start/stop data block rebalancing or query its status. 

3. How to balance?
    (a) Upon receiving a data block rebalancing request, a name node creates a Balancing thread. 
    (b) The thread performs rebalancing iteratively. 
          # At each iteration, it scans the whole data node list and schedules block moving tasks. It sleeps for a heartbeat interval between iterations;
          # When scanning the data node list, if it finds an under-capacity data node, it schedules moving blocks to the data node. The source data node is chosen randomly from over-capacity data nodes or non-under-capacity data nodes if no over-capacity data node exists. The source block is randomly chosen from the source data node as long as the block moving does not violate requirement (1).
          # If the thread finds an over-capacity data node, it scheduls moving blocks from the data node to other data nodes. It chooses a target data node randomly from under-capacity data nodes or non-over-capcity data nodes when there is no under-capacity data node; It then randomly chooses a source block that does not violate requirement (1). 
          # The scheduled tasks are put to a queue in the source data node. The task queue has a limited length of 4 by default and is configurable.
          # The scheduled tasks are sent to data nodes to execute in responding to a heartbeat message. Currently dfs limits at most 2 tasks per heartbeat by default.
    (c) The thread stops and frees itself when the cluster becomes balanced.


 was:
Here are some of my initial thoughts. Please comment.

1. What's balance?
A cluster is balanced iff there is no under-capactiy or over-capacity data nodes in the cluster.
An under-capacity data node is a node that its %used space is less than avg_%used_space-threshhold.
An over-capacity data node is a node that its %used space is greater than avg_%used_space+threshhold. 
A threshold is user configurable. A default value could be 20% of % used space.

2. When to rebalance?
Rebanlancing is performed on demand. An administrator issues a command to trigger rebalancing. Rebalancing automatically shuts off once the cluster is balanced and can also be interrupted by an administrator. The following commands are to be supported:
Hadoop dfsadmin balance <start/stop/get>
                  -----Start/stop data block rebalancing or query its status. 

3. How to balance?
    * Upon receiving a data block rebalancing request, a name node creates a Balancing thread. 
    * The thread performs rebalancing iteratively. 
          # At each iteration, it scans the whole data node list and schedules block moving tasks. It sleeps for a heartbeat interval between iterations;
          # When scanning the data node list, if it finds an under-capacity data node, it schedules moving blocks to the data node. The source data node is chosen randomly from over-capacity data nodes or non-under-capacity data nodes if no over-capacity data node exists. The source block is randomly chosen from the source data node as long as the block moving does not violate requirement (1).
          # If the thread finds an over-capacity data node, it scheduls moving blocks from the data node to other data nodes. It chooses a target data node randomly from under-capacity data nodes or non-over-capcity data nodes when there is no under-capacity data node; It then randomly chooses a source block that does not violate requirement (1). 
          # The scheduled tasks are put to a queue in the source data node. The task queue has a limited length of 4 by default and is configurable.
          # The scheduled tasks are sent to data nodes to execute in responding to a heartbeat message. Currently dfs limits at most 2 tasks per heartbeat by default.
    * The thread stops and frees itself when the cluster becomes balanced.

> Rebalance data blocks when new data nodes added or data nodes become full
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-1652
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1652
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.13.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.15.0
>
>
> When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism. 
> This jira aims to find an approach to redistribute data blocks when imbalance occurs in the cluster.  An solution should meet the following requirements:
> 1. It maintains data availablility guranteens in the sense that rebalancing does not reduce the number of replicas that a block has or the number of racks that the block resides.
> 2. An adminstrator should be able to invoke and interrupt rebalancing from a command line.
> 3. Rebalancing should be throttled so that rebalancing does not cause a namenode to be too busy to serve any incoming request or saturate the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1652) Rebalance data blocks when new data nodes added or data nodes become full

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-1652:
----------------------------------

    Attachment: BalancerUserGuide2.pdf

This is an updated user guide.

> Rebalance data blocks when new data nodes added or data nodes become full
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-1652
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1652
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.13.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: balancer.patch, balancer1.patch, balancer2.patch, balancer3.patch, balancer4.patch, balancer5.patch, BalancerAdminGuide.pdf, BalancerAdminGuide1.pdf, BalancerUserGuide2.pdf, RebalanceDesign4.pdf, RebalanceDesign5.pdf, RebalanceDesign6.pdf
>
>
> When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism. 
> This jira aims to find an approach to redistribute data blocks when imbalance occurs in the cluster.  An solution should meet the following requirements:
> 1. It maintains data availablility guranteens in the sense that rebalancing does not reduce the number of replicas that a block has or the number of racks that the block resides.
> 2. An adminstrator should be able to invoke and interrupt rebalancing from a command line.
> 3. Rebalancing should be throttled so that rebalancing does not cause a namenode to be too busy to serve any incoming request or saturate the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1652) Rebalance data blocks when new data nodes added or data nodes become full

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dhruba borthakur updated HADOOP-1652:
-------------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

I just committed this. Thanks Hairong!

> Rebalance data blocks when new data nodes added or data nodes become full
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-1652
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1652
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.13.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: balancer.patch, balancer1.patch, balancer2.patch, balancer3.patch, balancer4.patch, balancer5.patch, balancer6.patch, balancer7.patch, balancer8.patch, BalancerAdminGuide.pdf, BalancerAdminGuide1.pdf, BalancerUserGuide2.pdf, RebalanceDesign4.pdf, RebalanceDesign5.pdf, RebalanceDesign6.pdf
>
>
> When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism. 
> This jira aims to find an approach to redistribute data blocks when imbalance occurs in the cluster.  An solution should meet the following requirements:
> 1. It maintains data availablility guranteens in the sense that rebalancing does not reduce the number of replicas that a block has or the number of racks that the block resides.
> 2. An adminstrator should be able to invoke and interrupt rebalancing from a command line.
> 3. Rebalancing should be throttled so that rebalancing does not cause a namenode to be too busy to serve any incoming request or saturate the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1652) Rebalance data blocks when new data nodes added or data nodes become full

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-1652:
----------------------------------

    Attachment: RebalanceDesign4.pdf

This is a more detailed design document for the rebalancing tool. A major change is that the rebalancing decisions are made in a seperate process from name node. Data nodes are throttled to prevent rebalancing using too much network bandwidth. I also add the protocol design, race condition discusssion, and a test plan.

> Rebalance data blocks when new data nodes added or data nodes become full
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-1652
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1652
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.13.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.15.0
>
>         Attachments: RebalanceDesign4.pdf
>
>
> When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism. 
> This jira aims to find an approach to redistribute data blocks when imbalance occurs in the cluster.  An solution should meet the following requirements:
> 1. It maintains data availablility guranteens in the sense that rebalancing does not reduce the number of replicas that a block has or the number of racks that the block resides.
> 2. An adminstrator should be able to invoke and interrupt rebalancing from a command line.
> 3. Rebalancing should be throttled so that rebalancing does not cause a namenode to be too busy to serve any incoming request or saturate the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1652) Rebalance data blocks when new data nodes added or data nodes become full

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Owen O'Malley updated HADOOP-1652:
----------------------------------

    Fix Version/s:     (was: 0.15.0)
                   0.16.0

> Rebalance data blocks when new data nodes added or data nodes become full
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-1652
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1652
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.13.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: RebalanceDesign4.pdf, RebalanceDesign5.pdf
>
>
> When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism. 
> This jira aims to find an approach to redistribute data blocks when imbalance occurs in the cluster.  An solution should meet the following requirements:
> 1. It maintains data availablility guranteens in the sense that rebalancing does not reduce the number of replicas that a block has or the number of racks that the block resides.
> 2. An adminstrator should be able to invoke and interrupt rebalancing from a command line.
> 3. Rebalancing should be throttled so that rebalancing does not cause a namenode to be too busy to serve any incoming request or saturate the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1652) Rebalance data blocks when new data nodes added or data nodes become full

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-1652:
----------------------------------

    Status: Patch Available  (was: Open)

> Rebalance data blocks when new data nodes added or data nodes become full
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-1652
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1652
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.13.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: balancer.patch, balancer1.patch, balancer2.patch, balancer3.patch, balancer4.patch, balancer5.patch, BalancerAdminGuide.pdf, BalancerAdminGuide1.pdf, RebalanceDesign4.pdf, RebalanceDesign5.pdf, RebalanceDesign6.pdf
>
>
> When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism. 
> This jira aims to find an approach to redistribute data blocks when imbalance occurs in the cluster.  An solution should meet the following requirements:
> 1. It maintains data availablility guranteens in the sense that rebalancing does not reduce the number of replicas that a block has or the number of racks that the block resides.
> 2. An adminstrator should be able to invoke and interrupt rebalancing from a command line.
> 3. Rebalancing should be throttled so that rebalancing does not cause a namenode to be too busy to serve any incoming request or saturate the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1652) Rebalance data blocks when new data nodes added or data nodes become full

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-1652:
----------------------------------

    Status: Open  (was: Patch Available)

> Rebalance data blocks when new data nodes added or data nodes become full
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-1652
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1652
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.13.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: balancer.patch, balancer1.patch, balancer2.patch, balancer3.patch, balancer4.patch, balancer5.patch, balancer6.patch, BalancerAdminGuide.pdf, BalancerAdminGuide1.pdf, BalancerUserGuide2.pdf, RebalanceDesign4.pdf, RebalanceDesign5.pdf, RebalanceDesign6.pdf
>
>
> When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism. 
> This jira aims to find an approach to redistribute data blocks when imbalance occurs in the cluster.  An solution should meet the following requirements:
> 1. It maintains data availablility guranteens in the sense that rebalancing does not reduce the number of replicas that a block has or the number of racks that the block resides.
> 2. An adminstrator should be able to invoke and interrupt rebalancing from a command line.
> 3. Rebalancing should be throttled so that rebalancing does not cause a namenode to be too busy to serve any incoming request or saturate the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1652) Rebalance data blocks when new data nodes added or data nodes become full

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515802 ] 

Hairong Kuang commented on HADOOP-1652:
---------------------------------------

Yes, I agree with you. My definition of a balanced cluster was in term of %used space.

> Rebalance data blocks when new data nodes added or data nodes become full
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-1652
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1652
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.13.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.15.0
>
>
> When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism. 
> This jira aims to find an approach to redistribute data blocks when imbalance occurs in the cluster.  An solution should meet the following requirements:
> 1. It maintains data availablility guranteens in the sense that rebalancing does not reduce the number of replicas that a block has or the number of racks that the block resides.
> 2. An adminstrator should be able to invoke and interrupt rebalancing from a command line.
> 3. Rebalancing should be throttled so that rebalancing does not cause a namenode to be too busy to serve any incoming request or saturate the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1652) Rebalance data blocks when new data nodes added or data nodes become full

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-1652:
----------------------------------

    Attachment: BalancerAdminGuide1.pdf

Here is a new administrator guide that reflects the change to the balancer.

> Rebalance data blocks when new data nodes added or data nodes become full
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-1652
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1652
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.13.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: balancer.patch, balancer1.patch, balancer2.patch, BalancerAdminGuide.pdf, BalancerAdminGuide1.pdf, RebalanceDesign4.pdf, RebalanceDesign5.pdf, RebalanceDesign6.pdf
>
>
> When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism. 
> This jira aims to find an approach to redistribute data blocks when imbalance occurs in the cluster.  An solution should meet the following requirements:
> 1. It maintains data availablility guranteens in the sense that rebalancing does not reduce the number of replicas that a block has or the number of racks that the block resides.
> 2. An adminstrator should be able to invoke and interrupt rebalancing from a command line.
> 3. Rebalancing should be throttled so that rebalancing does not cause a namenode to be too busy to serve any incoming request or saturate the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1652) Rebalance data blocks when new data nodes added or data nodes become full

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12538648 ] 

Hairong Kuang commented on HADOOP-1652:
---------------------------------------

Currently balancer does not check if the cluster is "finalizedUpgrage" d or not. I can run a test to see what's going on if reblancing is performed when a cluster is upgraded but not finalized yet.

> Rebalance data blocks when new data nodes added or data nodes become full
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-1652
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1652
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.13.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: balancer.patch, RebalanceDesign4.pdf, RebalanceDesign5.pdf, RebalanceDesign6.pdf
>
>
> When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism. 
> This jira aims to find an approach to redistribute data blocks when imbalance occurs in the cluster.  An solution should meet the following requirements:
> 1. It maintains data availablility guranteens in the sense that rebalancing does not reduce the number of replicas that a block has or the number of racks that the block resides.
> 2. An adminstrator should be able to invoke and interrupt rebalancing from a command line.
> 3. Rebalancing should be throttled so that rebalancing does not cause a namenode to be too busy to serve any incoming request or saturate the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1652) Rebalance data blocks when new data nodes added or data nodes become full

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-1652:
----------------------------------

    Status: Open  (was: Patch Available)

> Rebalance data blocks when new data nodes added or data nodes become full
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-1652
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1652
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.13.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: balancer.patch, balancer1.patch, balancer2.patch, balancer3.patch, balancer4.patch, balancer5.patch, balancer6.patch, BalancerAdminGuide.pdf, BalancerAdminGuide1.pdf, BalancerUserGuide2.pdf, RebalanceDesign4.pdf, RebalanceDesign5.pdf, RebalanceDesign6.pdf
>
>
> When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism. 
> This jira aims to find an approach to redistribute data blocks when imbalance occurs in the cluster.  An solution should meet the following requirements:
> 1. It maintains data availablility guranteens in the sense that rebalancing does not reduce the number of replicas that a block has or the number of racks that the block resides.
> 2. An adminstrator should be able to invoke and interrupt rebalancing from a command line.
> 3. Rebalancing should be throttled so that rebalancing does not cause a namenode to be too busy to serve any incoming request or saturate the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1652) Rebalance data blocks when new data nodes added or data nodes become full

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515472 ] 

Hairong Kuang commented on HADOOP-1652:
---------------------------------------

A destination datanode should also have an upper limit on the number of blocks to be scheduled to receive. Currently in dfs when re-replicating blocks, blocks are pushed from source to destination. It enforces max number of concurrent block transfers for the source, but it does not have any limit on the destination side. So a destination data node may end up receiving many blocks concurrently. Shall we also limit the number of concurret writes on a data node?

> Rebalance data blocks when new data nodes added or data nodes become full
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-1652
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1652
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.13.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.15.0
>
>
> When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism. 
> This jira aims to find an approach to redistribute data blocks when imbalance occurs in the cluster.  An solution should meet the following requirements:
> 1. It maintains data availablility guranteens in the sense that rebalancing does not reduce the number of replicas that a block has or the number of racks that the block resides.
> 2. An adminstrator should be able to invoke and interrupt rebalancing from a command line.
> 3. Rebalancing should be throttled so that rebalancing does not cause a namenode to be too busy to serve any incoming request or saturate the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1652) Rebalance data blocks when new data nodes added or data nodes become full

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12548523 ] 

Hadoop QA commented on HADOOP-1652:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
http://issues.apache.org/jira/secure/attachment/12370987/balancer7.patch
against trunk revision r601111.

    @author +1.  The patch does not contain any @author tags.

    javadoc +1.  The javadoc tool did not generate any warning messages.

    javac +1.  The applied patch does not generate any new compiler warnings.

    findbugs +1.  The patch does not introduce any new Findbugs warnings.

    core tests +1.  The patch passed core unit tests.

    contrib tests -1.  The patch failed contrib unit tests.

Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1264/testReport/
Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1264/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1264/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1264/console

This message is automatically generated.

> Rebalance data blocks when new data nodes added or data nodes become full
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-1652
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1652
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.13.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: balancer.patch, balancer1.patch, balancer2.patch, balancer3.patch, balancer4.patch, balancer5.patch, balancer6.patch, balancer7.patch, BalancerAdminGuide.pdf, BalancerAdminGuide1.pdf, BalancerUserGuide2.pdf, RebalanceDesign4.pdf, RebalanceDesign5.pdf, RebalanceDesign6.pdf
>
>
> When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism. 
> This jira aims to find an approach to redistribute data blocks when imbalance occurs in the cluster.  An solution should meet the following requirements:
> 1. It maintains data availablility guranteens in the sense that rebalancing does not reduce the number of replicas that a block has or the number of racks that the block resides.
> 2. An adminstrator should be able to invoke and interrupt rebalancing from a command line.
> 3. Rebalancing should be throttled so that rebalancing does not cause a namenode to be too busy to serve any incoming request or saturate the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1652) Rebalance data blocks when new data nodes added or data nodes become full

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-1652:
----------------------------------

    Attachment: balancer1.patch

Here is a new patch that incorporates Sanjay's first pass of review comments.

> Rebalance data blocks when new data nodes added or data nodes become full
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-1652
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1652
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.13.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: balancer.patch, balancer1.patch, BalancerAdminGuide.pdf, RebalanceDesign4.pdf, RebalanceDesign5.pdf, RebalanceDesign6.pdf
>
>
> When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism. 
> This jira aims to find an approach to redistribute data blocks when imbalance occurs in the cluster.  An solution should meet the following requirements:
> 1. It maintains data availablility guranteens in the sense that rebalancing does not reduce the number of replicas that a block has or the number of racks that the block resides.
> 2. An adminstrator should be able to invoke and interrupt rebalancing from a command line.
> 3. Rebalancing should be throttled so that rebalancing does not cause a namenode to be too busy to serve any incoming request or saturate the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1652) Rebalance data blocks when new data nodes added or data nodes become full

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-1652:
----------------------------------

    Attachment: balancer7.patch

findbugs fix.

> Rebalance data blocks when new data nodes added or data nodes become full
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-1652
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1652
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.13.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: balancer.patch, balancer1.patch, balancer2.patch, balancer3.patch, balancer4.patch, balancer5.patch, balancer6.patch, balancer7.patch, BalancerAdminGuide.pdf, BalancerAdminGuide1.pdf, BalancerUserGuide2.pdf, RebalanceDesign4.pdf, RebalanceDesign5.pdf, RebalanceDesign6.pdf
>
>
> When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism. 
> This jira aims to find an approach to redistribute data blocks when imbalance occurs in the cluster.  An solution should meet the following requirements:
> 1. It maintains data availablility guranteens in the sense that rebalancing does not reduce the number of replicas that a block has or the number of racks that the block resides.
> 2. An adminstrator should be able to invoke and interrupt rebalancing from a command line.
> 3. Rebalancing should be throttled so that rebalancing does not cause a namenode to be too busy to serve any incoming request or saturate the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1652) Rebalance data blocks when new data nodes added or data nodes become full

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12549008 ] 

Hudson commented on HADOOP-1652:
--------------------------------

Integrated in Hadoop-Nightly #324 (See [http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/324/])

> Rebalance data blocks when new data nodes added or data nodes become full
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-1652
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1652
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.13.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: balancer.patch, balancer1.patch, balancer2.patch, balancer3.patch, balancer4.patch, balancer5.patch, balancer6.patch, balancer7.patch, balancer8.patch, BalancerAdminGuide.pdf, BalancerAdminGuide1.pdf, BalancerUserGuide2.pdf, RebalanceDesign4.pdf, RebalanceDesign5.pdf, RebalanceDesign6.pdf
>
>
> When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism. 
> This jira aims to find an approach to redistribute data blocks when imbalance occurs in the cluster.  An solution should meet the following requirements:
> 1. It maintains data availablility guranteens in the sense that rebalancing does not reduce the number of replicas that a block has or the number of racks that the block resides.
> 2. An adminstrator should be able to invoke and interrupt rebalancing from a command line.
> 3. Rebalancing should be throttled so that rebalancing does not cause a namenode to be too busy to serve any incoming request or saturate the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1652) Rebalance data blocks when new data nodes added or data nodes become full

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-1652:
----------------------------------

    Attachment: balancer1.patch

> Rebalance data blocks when new data nodes added or data nodes become full
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-1652
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1652
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.13.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: balancer.patch, balancer1.patch, BalancerAdminGuide.pdf, RebalanceDesign4.pdf, RebalanceDesign5.pdf, RebalanceDesign6.pdf
>
>
> When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism. 
> This jira aims to find an approach to redistribute data blocks when imbalance occurs in the cluster.  An solution should meet the following requirements:
> 1. It maintains data availablility guranteens in the sense that rebalancing does not reduce the number of replicas that a block has or the number of racks that the block resides.
> 2. An adminstrator should be able to invoke and interrupt rebalancing from a command line.
> 3. Rebalancing should be throttled so that rebalancing does not cause a namenode to be too busy to serve any incoming request or saturate the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1652) Rebalance data blocks when new data nodes added or data nodes become full

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-1652:
----------------------------------

    Attachment: RebalanceDesign6.pdf

Upload the most updated design documents.

> Rebalance data blocks when new data nodes added or data nodes become full
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-1652
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1652
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.13.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: RebalanceDesign4.pdf, RebalanceDesign5.pdf, RebalanceDesign6.pdf
>
>
> When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism. 
> This jira aims to find an approach to redistribute data blocks when imbalance occurs in the cluster.  An solution should meet the following requirements:
> 1. It maintains data availablility guranteens in the sense that rebalancing does not reduce the number of replicas that a block has or the number of racks that the block resides.
> 2. An adminstrator should be able to invoke and interrupt rebalancing from a command line.
> 3. Rebalancing should be throttled so that rebalancing does not cause a namenode to be too busy to serve any incoming request or saturate the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1652) Rebalance data blocks when new data nodes added or data nodes become full

Posted by "Sanjay Radia (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12548421 ] 

Sanjay Radia commented on HADOOP-1652:
--------------------------------------

Code looks good
+1

> Rebalance data blocks when new data nodes added or data nodes become full
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-1652
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1652
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.13.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: balancer.patch, balancer1.patch, balancer2.patch, balancer3.patch, balancer4.patch, balancer5.patch, BalancerAdminGuide.pdf, BalancerAdminGuide1.pdf, BalancerUserGuide2.pdf, RebalanceDesign4.pdf, RebalanceDesign5.pdf, RebalanceDesign6.pdf
>
>
> When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism. 
> This jira aims to find an approach to redistribute data blocks when imbalance occurs in the cluster.  An solution should meet the following requirements:
> 1. It maintains data availablility guranteens in the sense that rebalancing does not reduce the number of replicas that a block has or the number of racks that the block resides.
> 2. An adminstrator should be able to invoke and interrupt rebalancing from a command line.
> 3. Rebalancing should be throttled so that rebalancing does not cause a namenode to be too busy to serve any incoming request or saturate the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1652) Rebalance data blocks when new data nodes added or data nodes become full

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-1652:
----------------------------------

    Status: Patch Available  (was: Open)

> Rebalance data blocks when new data nodes added or data nodes become full
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-1652
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1652
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.13.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: balancer.patch, balancer1.patch, balancer2.patch, balancer3.patch, balancer4.patch, balancer5.patch, balancer6.patch, balancer7.patch, BalancerAdminGuide.pdf, BalancerAdminGuide1.pdf, BalancerUserGuide2.pdf, RebalanceDesign4.pdf, RebalanceDesign5.pdf, RebalanceDesign6.pdf
>
>
> When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism. 
> This jira aims to find an approach to redistribute data blocks when imbalance occurs in the cluster.  An solution should meet the following requirements:
> 1. It maintains data availablility guranteens in the sense that rebalancing does not reduce the number of replicas that a block has or the number of racks that the block resides.
> 2. An adminstrator should be able to invoke and interrupt rebalancing from a command line.
> 3. Rebalancing should be throttled so that rebalancing does not cause a namenode to be too busy to serve any incoming request or saturate the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1652) Rebalance data blocks when new data nodes added or data nodes become full

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-1652:
----------------------------------

    Attachment: balancer3.patch

I added one more change to the balancer.  It shuffles the datanode array before constructing overUtlilizedDatanodeList, underUtilizedDatanodeList etc. This adds some randomness to the static source/target datanode match.

> Rebalance data blocks when new data nodes added or data nodes become full
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-1652
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1652
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.13.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: balancer.patch, balancer1.patch, balancer2.patch, balancer3.patch, BalancerAdminGuide.pdf, BalancerAdminGuide1.pdf, RebalanceDesign4.pdf, RebalanceDesign5.pdf, RebalanceDesign6.pdf
>
>
> When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism. 
> This jira aims to find an approach to redistribute data blocks when imbalance occurs in the cluster.  An solution should meet the following requirements:
> 1. It maintains data availablility guranteens in the sense that rebalancing does not reduce the number of replicas that a block has or the number of racks that the block resides.
> 2. An adminstrator should be able to invoke and interrupt rebalancing from a command line.
> 3. Rebalancing should be throttled so that rebalancing does not cause a namenode to be too busy to serve any incoming request or saturate the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1652) Rebalance data blocks when new data nodes added or data nodes become full

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516117 ] 

Hairong Kuang commented on HADOOP-1652:
---------------------------------------

Arun, yes, I agree. This can be done by instructing data nodes to balance its partitions. I'd like to do it in a different jira.

> Rebalance data blocks when new data nodes added or data nodes become full
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-1652
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1652
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.13.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.15.0
>
>
> When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism. 
> This jira aims to find an approach to redistribute data blocks when imbalance occurs in the cluster.  An solution should meet the following requirements:
> 1. It maintains data availablility guranteens in the sense that rebalancing does not reduce the number of replicas that a block has or the number of racks that the block resides.
> 2. An adminstrator should be able to invoke and interrupt rebalancing from a command line.
> 3. Rebalancing should be throttled so that rebalancing does not cause a namenode to be too busy to serve any incoming request or saturate the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1652) Rebalance data blocks when new data nodes added or data nodes become full

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-1652:
----------------------------------

    Attachment: balancer.patch

A patch for review.

> Rebalance data blocks when new data nodes added or data nodes become full
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-1652
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1652
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.13.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: balancer.patch, RebalanceDesign4.pdf, RebalanceDesign5.pdf, RebalanceDesign6.pdf
>
>
> When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism. 
> This jira aims to find an approach to redistribute data blocks when imbalance occurs in the cluster.  An solution should meet the following requirements:
> 1. It maintains data availablility guranteens in the sense that rebalancing does not reduce the number of replicas that a block has or the number of racks that the block resides.
> 2. An adminstrator should be able to invoke and interrupt rebalancing from a command line.
> 3. Rebalancing should be throttled so that rebalancing does not cause a namenode to be too busy to serve any incoming request or saturate the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.