You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-dev@hadoop.apache.org by "Andras Gyori (Jira)" <ji...@apache.org> on 2021/11/25 18:19:00 UTC

[jira] [Created] (YARN-11016) Queue weight is incorrectly reset to zero

Andras Gyori created YARN-11016:
-----------------------------------

             Summary: Queue weight is incorrectly reset to zero
                 Key: YARN-11016
                 URL: https://issues.apache.org/jira/browse/YARN-11016
             Project: Hadoop YARN
          Issue Type: Bug
          Components: capacity scheduler
            Reporter: Andras Gyori
            Assignee: Andras Gyori


QueueCapacities#clearConfigurableFields set WEIGHT capacity to 0, which could cause problems like in the following scenario:
1. Initializing queues
2. Parent 'parent' have accessibleNodeLabels set, and since accessible node labels are inherited, its children, for example 'child' has 'test' label as its accessible-node-label.
3. In LeafQueue#updateClusterResource, we call LeafQueue#activateApplications, which then calls LeafQueue#calculateAndGetAMResourceLimitPerPartition for each labels (see getNodeLabelsForQueue). In this case, the labels are the accessible node labels (the inherited 'test). During this event the ResourceUsage object is updated for the label 'test', thus extending its nodeLabelsSet with 'test'.
4. In a following updateClusterResource call, for example an addNode event, we now have 'test' label in ResourceUsage even though it was never explicitly configured and we call CSQueueUtils#updateQueueStatistics, that takes the union of the node labels from QueueCapacities and ResourceUsage (this union is now the empty default label AND 'test') and updates QueueCapacities with the label 'perf-test'. Now QueueCapacities has 'test' in its nodeLabelsSet as well!
5. After a reinitialization (like an update from mutation API) the CSQueueUtils#loadCapacitiesByLabelsFromCon is called, which resets the QueueCapacities values to zero (even weight, which is wrong in my opinion) and loads the values again from config. The problem here is that values are reset for all node labels in QueueCapacities (even for 'test'), but we only load the values for the configured node labels (which we did not set, so it is defaulted to the empty label).
6. Now all children of 'parent' have weight=0 for 'test' in QueueCapacities and that is why the update fails. It even explains why validation passes, because the validation endpoint instantiates a brand new CapacityScheduler for which these cascade of effects can not accumulate (as there are no multiple updateClusterResource calls)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-dev-help@hadoop.apache.org