You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-dev@hadoop.apache.org by "Erik Krogen (Jira)" <ji...@apache.org> on 2019/11/08 18:06:00 UTC

[jira] [Created] (HDFS-14973) Balancer getBlocks RPC dispersal does not function properly

Erik Krogen created HDFS-14973:
----------------------------------

             Summary: Balancer getBlocks RPC dispersal does not function properly
                 Key: HDFS-14973
                 URL: https://issues.apache.org/jira/browse/HDFS-14973
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: balancer &amp; mover
    Affects Versions: 3.0.0, 2.8.2, 2.7.4, 2.9.0
            Reporter: Erik Krogen
            Assignee: Erik Krogen


In HDFS-11384, a mechanism was added to make the {{getBlocks}} RPC calls issued by the Balancer/Mover more dispersed, to alleviate load on the NameNode, since {{getBlocks}} can be very expensive and the Balancer should not impact normal cluster operation.

Unfortunately, this functionality does not function as expected, especially when the dispatcher thread count is low. The primary issue is that the delay is applied only to the first N threads that are submitted to the dispatcher's executor, where N is the size of the dispatcher's threadpool, but *not* to the first R threads, where R is the number of allowed {{getBlocks}} QPS (currently hardcoded to 20). For example, if the threadpool size is 100 (the default), threads 0-19 have no delay, 20-99 have increased levels of delay, and 100+ have no delay. As I understand it, the intent of the logic was that the delay applied to the first 100 threads would force the dispatcher executor's threads to all be consumed, thus blocking subsequent (non-delayed) threads until the delay period has expired. However, threads 0-19 can finish very quickly (their work can often be fulfilled in the time it takes to execute a single {{getBlocks}} RPC, on the order of tens of milliseconds), thus opening up 20 new slots in the executor, which are then consumed by non-delayed threads 100-119, and so on. So, although 80 threads have had a delay applied, the non-delay threads rush through in the 20 non-delay slots.

This problem gets even worse when the dispatcher threadpool size is less than the max {{getBlocks}} QPS. For example, if the threadpool size is 10, _no threads ever have a delay applied_, and the feature is not enabled at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org