You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Hairong Kuang (JIRA)" <ji...@apache.org> on 2008/09/12 20:28:46 UTC

[jira] Commented: (HADOOP-4116) DataNode : idle rebalancing operations need not take up threads.

    [ https://issues.apache.org/jira/browse/HADOOP-4116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630647#action_12630647 ] 

Hairong Kuang commented on HADOOP-4116:
---------------------------------------

The more close investigation of the problem shows the balancer needs additional improvements:

(1) The balancer needs to better handle block move timeout well. Currently it simply assumes that
the timeouted move is failed but does not take the effort to make sure the move is interrupted and the resources the
move takes is released. The next phase of scheduling may schedule more blocks to move from the same DataNode thus using
more and more resources.

(2) Resource control for the balancing purpose at DataNodes should use a fair Semaphore. Currently
it uses an unfair Semaphore that makes no guarantees about the order in which threads acquire permits. A
thread invoking acquire() can be allocated a permit ahead of a thread that has been waiting. Therefore, if a dfs
cluster has many DataNodes that has a long queue of block move requests, it is very likely to enter the
following state: A thread in DataNode A holding a permit and asks DataNode B to receive a block, while DataNode B has a
thread holding a Semaphore and asking DataNode A to receive a block. Although the block move from B to A was scheduled
much later than the move from A to B, they may be executed simultaneously. Both block receives are blocks on acquiring
a permit assuming only one permit can be issued. Therefore, a deadlock occurs.

> DataNode : idle rebalancing operations need not take up threads.
> ----------------------------------------------------------------
>
>                 Key: HADOOP-4116
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4116
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.17.0
>            Reporter: Raghu Angadi
>
> The number of threads are currently limited on datanodes. Once these threads are occupied, DataNode does not accept any more requests (DOS). Recently we saw a case where most of the 256 threads were waiting in {{DataXceiver.replaceBlock()}} trying to acquire  {{balancingSem}}.  Since rebalancing  is (heavily) throttled, I would think this would be the common case. 
> These operations waiting  for active rebalancing threads to finish need not take up a thread. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.