You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by feedly team <fe...@gmail.com> on 2014/03/26 15:51:41 UTC

re-replication after data node failure

We recently had a node die in our hbase cluster. Afterwards, we saw a huge
increase in traffic and I/O as hdfs re-replicated data from the dead node.
This negatively affected our application and we are trying to see if there
is a way to slow down this process so the app can still run (if a bit
slower).

Is the balancer job responsible for re-replication? This was our first
thought but the docs mostly mention balancing disk utilization rather than
restoring the replication factor, so we aren't sure if it's responsible or
if it's some other process.

If it is indeed the balancer, we saw there is a dfs.balance.bandwidthPerSec
setting that we could change. The default is 1MB, does this mean that each
node sends and receives at most 1MB/sec during balancing? We saw much, much
higher sustained traffic than this. The levels we saw would be roughly
correct if this is the in + out limit per data node pair. I.e. if you have
a 5 node cluster, node1 would be limited to 1MB to each of the other 4
nodes, meaning the node would experience 4MB/s of traffic.

Re: re-replication after data node failure

Posted by John Meagher <jo...@gmail.com>.

The balancer is not what handles adding extra replicas in the case of
a node failure, but it looks like the balancer bandwidth setting is
the way to throttle.  See:
http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201301.mbox/%3C50F870C1.5010208@getjar.com%3E

On Wed, Mar 26, 2014 at 10:51 AM, feedly team <fe...@gmail.com> wrote:
> We recently had a node die in our hbase cluster. Afterwards, we saw a huge
> increase in traffic and I/O as hdfs re-replicated data from the dead node.
> This negatively affected our application and we are trying to see if there
> is a way to slow down this process so the app can still run (if a bit
> slower).
>
> Is the balancer job responsible for re-replication? This was our first
> thought but the docs mostly mention balancing disk utilization rather than
> restoring the replication factor, so we aren't sure if it's responsible or
> if it's some other process.
>
> If it is indeed the balancer, we saw there is a dfs.balance.bandwidthPerSec
> setting that we could change. The default is 1MB, does this mean that each
> node sends and receives at most 1MB/sec during balancing? We saw much, much
> higher sustained traffic than this. The levels we saw would be roughly
> correct if this is the in + out limit per data node pair. I.e. if you have a
> 5 node cluster, node1 would be limited to 1MB to each of the other 4 nodes,
> meaning the node would experience 4MB/s of traffic.

Re: re-replication after data node failure

Posted by John Meagher <jo...@gmail.com>.

The balancer is not what handles adding extra replicas in the case of
a node failure, but it looks like the balancer bandwidth setting is
the way to throttle.  See:
http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201301.mbox/%3C50F870C1.5010208@getjar.com%3E

On Wed, Mar 26, 2014 at 10:51 AM, feedly team <fe...@gmail.com> wrote:
> We recently had a node die in our hbase cluster. Afterwards, we saw a huge
> increase in traffic and I/O as hdfs re-replicated data from the dead node.
> This negatively affected our application and we are trying to see if there
> is a way to slow down this process so the app can still run (if a bit
> slower).
>
> Is the balancer job responsible for re-replication? This was our first
> thought but the docs mostly mention balancing disk utilization rather than
> restoring the replication factor, so we aren't sure if it's responsible or
> if it's some other process.
>
> If it is indeed the balancer, we saw there is a dfs.balance.bandwidthPerSec
> setting that we could change. The default is 1MB, does this mean that each
> node sends and receives at most 1MB/sec during balancing? We saw much, much
> higher sustained traffic than this. The levels we saw would be roughly
> correct if this is the in + out limit per data node pair. I.e. if you have a
> 5 node cluster, node1 would be limited to 1MB to each of the other 4 nodes,
> meaning the node would experience 4MB/s of traffic.

Re: re-replication after data node failure

Posted by John Meagher <jo...@gmail.com>.

The balancer is not what handles adding extra replicas in the case of
a node failure, but it looks like the balancer bandwidth setting is
the way to throttle.  See:
http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201301.mbox/%3C50F870C1.5010208@getjar.com%3E

On Wed, Mar 26, 2014 at 10:51 AM, feedly team <fe...@gmail.com> wrote:
> We recently had a node die in our hbase cluster. Afterwards, we saw a huge
> increase in traffic and I/O as hdfs re-replicated data from the dead node.
> This negatively affected our application and we are trying to see if there
> is a way to slow down this process so the app can still run (if a bit
> slower).
>
> Is the balancer job responsible for re-replication? This was our first
> thought but the docs mostly mention balancing disk utilization rather than
> restoring the replication factor, so we aren't sure if it's responsible or
> if it's some other process.
>
> If it is indeed the balancer, we saw there is a dfs.balance.bandwidthPerSec
> setting that we could change. The default is 1MB, does this mean that each
> node sends and receives at most 1MB/sec during balancing? We saw much, much
> higher sustained traffic than this. The levels we saw would be roughly
> correct if this is the in + out limit per data node pair. I.e. if you have a
> 5 node cluster, node1 would be limited to 1MB to each of the other 4 nodes,
> meaning the node would experience 4MB/s of traffic.

Re: re-replication after data node failure

Posted by John Meagher <jo...@gmail.com>.

The balancer is not what handles adding extra replicas in the case of
a node failure, but it looks like the balancer bandwidth setting is
the way to throttle.  See:
http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201301.mbox/%3C50F870C1.5010208@getjar.com%3E

On Wed, Mar 26, 2014 at 10:51 AM, feedly team <fe...@gmail.com> wrote:
> We recently had a node die in our hbase cluster. Afterwards, we saw a huge
> increase in traffic and I/O as hdfs re-replicated data from the dead node.
> This negatively affected our application and we are trying to see if there
> is a way to slow down this process so the app can still run (if a bit
> slower).
>
> Is the balancer job responsible for re-replication? This was our first
> thought but the docs mostly mention balancing disk utilization rather than
> restoring the replication factor, so we aren't sure if it's responsible or
> if it's some other process.
>
> If it is indeed the balancer, we saw there is a dfs.balance.bandwidthPerSec
> setting that we could change. The default is 1MB, does this mean that each
> node sends and receives at most 1MB/sec during balancing? We saw much, much
> higher sustained traffic than this. The levels we saw would be roughly
> correct if this is the in + out limit per data node pair. I.e. if you have a
> 5 node cluster, node1 would be limited to 1MB to each of the other 4 nodes,
> meaning the node would experience 4MB/s of traffic.