You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Siddharth Ahuja (Jira)" <ji...@apache.org> on 2020/12/16 07:52:00 UTC

[jira] [Comment Edited] (YARN-10528) maxAMShare should only be accepted for leaf queues, not parent queues

    [ https://issues.apache.org/jira/browse/YARN-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17250153#comment-17250153 ] 

Siddharth Ahuja edited comment on YARN-10528 at 12/16/20, 7:51 AM:
-------------------------------------------------------------------

I have made the behaviour similar to the {{reservation}} element in code.

Performed the following testing on the single node cluster:

Have FS XML as follows:

{code}
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<allocations>
    <queue name="root">
        <weight>1.0</weight>
        <schedulingPolicy>drf</schedulingPolicy>
        <aclSubmitApps>*</aclSubmitApps>
        <aclAdministerApps>*</aclAdministerApps>
        <queue name="default">
            <weight>1.0</weight>
            <schedulingPolicy>drf</schedulingPolicy>
        </queue>
        <queue name="users" type="parent">
            <weight>1.0</weight>
            <schedulingPolicy>drf</schedulingPolicy>
            <maxAMShare>0.76</maxAMShare> <------------------------------------------------- root.users is a parent queue with maxAMShare set. This should not be possible.
        </queue>
        <queue name="blah">
            <weight>1.0</weight>
            <schedulingPolicy>drf</schedulingPolicy>
            <queue name="child">
                <weight>1.0</weight>
                <schedulingPolicy>drf</schedulingPolicy>
            </queue>
        </queue>
        <queue name="blah2" type="parent">
            <weight>1.0</weight>
            <schedulingPolicy>drf</schedulingPolicy>
            <queue name="child2">
                <weight>1.0</weight>
                <schedulingPolicy>drf</schedulingPolicy>
            </queue>
        </queue>
    </queue>
    <defaultQueueSchedulingPolicy>fair</defaultQueueSchedulingPolicy>
    <queueMaxAMShareDefault>0.75</queueMaxAMShareDefault>
    <queuePlacementPolicy>
        <rule name="specified" create="true"/>
        <rule name="nestedUserQueue" create="true">
            <rule name="default" create="true" queue="users"/>
        </rule>
    </queuePlacementPolicy>
</allocations>
{code}

Refresh YARN queues and observe the RM logs:

{code}
% bin/yarn rmadmin -refreshQueues
{code}

{code}
2020-12-16 18:12:29,665 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService: Failed to reload fair scheduler config file - will use existing allocations.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException: The configuration settings for root.users are invalid. A queue element that contains child queue elements or that has the type='parent' attribute cannot also include a maxAMShare element.
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:238)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:221)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.parse(AllocationFileQueueParser.java:97)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:257)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.lambda$serviceInit$0(AllocationFileLoaderService.java:128)
at java.lang.Thread.run(Thread.java:748)


2020-12-16 18:15:04,056 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Failed to reload allocations file
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException: The configuration settings for root.users are invalid. A queue element that contains child queue elements or that has the type='parent' attribute cannot also include a maxAMShare element.
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:238)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:221)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.parse(AllocationFileQueueParser.java:97)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:257)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.reinitialize(FairScheduler.java:1571)
at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:438)
at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:409)
at org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceManagerAdministrationProtocolPBServiceImpl.refreshQueues(ResourceManagerAdministrationProtocolPBServiceImpl.java:120)
at org.apache.hadoop.yarn.proto.ResourceManagerAdministrationProtocol$ResourceManagerAdministrationProtocolService$2.callBlockingMethod(ResourceManagerAdministrationProtocol.java:293)
at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:537)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1086)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1035)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:963)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2966)
{code}

Now, update FS XML such that {{maxAMShare}} is not set for root.users but set for a parent queue which is not explicitly tagged as one with "type=parent":

{code}
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<allocations>
    <queue name="root">
        <weight>1.0</weight>
        <schedulingPolicy>drf</schedulingPolicy>
        <aclSubmitApps>*</aclSubmitApps>
        <aclAdministerApps>*</aclAdministerApps>
        <queue name="default">
            <weight>1.0</weight>
            <schedulingPolicy>drf</schedulingPolicy>
        </queue>
        <queue name="users" type="parent">
            <weight>1.0</weight>
            <schedulingPolicy>drf</schedulingPolicy>
        </queue>
        <queue name="blah">
            <weight>1.0</weight>
            <schedulingPolicy>drf</schedulingPolicy>
            <maxAMShare>0.76</maxAMShare><-----------------------------------------------------Set maxAMShare for root.blah which is a parent queue to root.blah.child. This is no good as well.
            <queue name="child">
                <weight>1.0</weight>
                <schedulingPolicy>drf</schedulingPolicy>
            </queue>
        </queue>
        <queue name="blah2" type="parent">
            <weight>1.0</weight>
            <schedulingPolicy>drf</schedulingPolicy>
            <queue name="child2">
                <weight>1.0</weight>
                <schedulingPolicy>drf</schedulingPolicy>
            </queue>
        </queue>
    </queue>
    <defaultQueueSchedulingPolicy>fair</defaultQueueSchedulingPolicy>
    <queueMaxAMShareDefault>0.75</queueMaxAMShareDefault>
    <queuePlacementPolicy>
        <rule name="specified" create="true"/>
        <rule name="nestedUserQueue" create="true">
            <rule name="default" create="true" queue="users"/>
        </rule>
    </queuePlacementPolicy>
</allocations>
{code}

{code}
% bin/yarn rmadmin -refreshQueues
{code}

{code}
2020-12-16 18:20:49,345 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Failed to reload allocations file
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException: The configuration settings for root.blah are invalid. A queue element that contains child queue elements or that has the type='parent' attribute cannot also include a maxAMShare element.
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:238)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:221)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.parse(AllocationFileQueueParser.java:97)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:257)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.reinitialize(FairScheduler.java:1571)
at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:438)
at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:409)
at org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceManagerAdministrationProtocolPBServiceImpl.refreshQueues(ResourceManagerAdministrationProtocolPBServiceImpl.java:120)
at org.apache.hadoop.yarn.proto.ResourceManagerAdministrationProtocol$ResourceManagerAdministrationProtocolService$2.callBlockingMethod(ResourceManagerAdministrationProtocol.java:293)
at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:537)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1086)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1035)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:963)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2966)

2020-12-16 18:20:49,937 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService: Failed to reload fair scheduler config file - will use existing allocations.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException: The configuration settings for root.blah are invalid. A queue element that contains child queue elements or that has the type='parent' attribute cannot also include a maxAMShare element.
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:238)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:221)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.parse(AllocationFileQueueParser.java:97)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:257)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.lambda$serviceInit$0(AllocationFileLoaderService.java:128)
at java.lang.Thread.run(Thread.java:748)
{code}

Now, stop RM and restart RM. RM should fail to start:

{code}
2020-12-16 18:20:49,343 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService: Loading allocation file file:/Users/sidtheadmin/Cloudera/hadoop/hadoop-dist/target/hadoop-3.4.0-SNAPSHOT/etc/hadoop/fair-scheduler.xml
2020-12-16 18:20:49,345 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Failed to reload allocations file
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException: The configuration settings for root.blah are invalid. A queue element that contains child queue elements or that has the type='parent' attribute cannot also include a maxAMShare element.
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:238)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:221)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.parse(AllocationFileQueueParser.java:97)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:257)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.reinitialize(FairScheduler.java:1571)
at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:438)
at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:409)
at org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceManagerAdministrationProtocolPBServiceImpl.refreshQueues(ResourceManagerAdministrationProtocolPBServiceImpl.java:120)
at org.apache.hadoop.yarn.proto.ResourceManagerAdministrationProtocol$ResourceManagerAdministrationProtocolService$2.callBlockingMethod(ResourceManagerAdministrationProtocol.java:293)
at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:537)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1086)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1035)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:963)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2966)
2020-12-16 18:20:49,934 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService: Loading allocation file file:/Users/sidtheadmin/Cloudera/hadoop/hadoop-dist/target/hadoop-3.4.0-SNAPSHOT/etc/hadoop/fair-scheduler.xml
2020-12-16 18:20:49,937 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService: Failed to reload fair scheduler config file - will use existing allocations.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException: The configuration settings for root.blah are invalid. A queue element that contains child queue elements or that has the type='parent' attribute cannot also include a maxAMShare element.
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:238)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:221)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.parse(AllocationFileQueueParser.java:97)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:257)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.lambda$serviceInit$0(AllocationFileLoaderService.java:128)
at java.lang.Thread.run(Thread.java:748)
{code}

Therefore, from above, if RM is currently running and a bad config is applied through refreshQueues, then, RM continues to function with still the old settings in use as the new (bad) one is not accepted.

However, if the RM is restarted with a bad setting, then, it fails fast. Again, this behaviour is the same as the reservation element.

FWIW, I deleted an existing newline in the{{ loadQueue()}} method. Even though this is not specifically concerning the fixes for this issue, this was done to prevent the checkstyle error of method length exceeding 150 lines. It was not worth refactoring anything existing to prevent this error so the easiest way out was to just delete the redundant newline.

I have also implemented the JUnits and tested them thoroughly.


was (Author: sahuja):
I have made the behaviour similar to the reservation element in code.

Performed the following testing on the single node cluster:

Have FS XML as follows:

{code}
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<allocations>
    <queue name="root">
        <weight>1.0</weight>
        <schedulingPolicy>drf</schedulingPolicy>
        <aclSubmitApps>*</aclSubmitApps>
        <aclAdministerApps>*</aclAdministerApps>
        <queue name="default">
            <weight>1.0</weight>
            <schedulingPolicy>drf</schedulingPolicy>
        </queue>
        <queue name="users" type="parent">
            <weight>1.0</weight>
            <schedulingPolicy>drf</schedulingPolicy>
            <maxAMShare>0.76</maxAMShare> <------------------------------------------------- root.users is a parent queue with maxAMShare set. This should not be possible.
        </queue>
        <queue name="blah">
            <weight>1.0</weight>
            <schedulingPolicy>drf</schedulingPolicy>
            <queue name="child">
                <weight>1.0</weight>
                <schedulingPolicy>drf</schedulingPolicy>
            </queue>
        </queue>
        <queue name="blah2" type="parent">
            <weight>1.0</weight>
            <schedulingPolicy>drf</schedulingPolicy>
            <queue name="child2">
                <weight>1.0</weight>
                <schedulingPolicy>drf</schedulingPolicy>
            </queue>
        </queue>
    </queue>
    <defaultQueueSchedulingPolicy>fair</defaultQueueSchedulingPolicy>
    <queueMaxAMShareDefault>0.75</queueMaxAMShareDefault>
    <queuePlacementPolicy>
        <rule name="specified" create="true"/>
        <rule name="nestedUserQueue" create="true">
            <rule name="default" create="true" queue="users"/>
        </rule>
    </queuePlacementPolicy>
</allocations>
{code}

Refresh YARN queues and observe the RM logs:

{code}
% bin/yarn rmadmin -refreshQueues
{code}

{code}
2020-12-16 18:12:29,665 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService: Failed to reload fair scheduler config file - will use existing allocations.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException: The configuration settings for root.users are invalid. A queue element that contains child queue elements or that has the type='parent' attribute cannot also include a maxAMShare element.
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:238)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:221)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.parse(AllocationFileQueueParser.java:97)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:257)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.lambda$serviceInit$0(AllocationFileLoaderService.java:128)
at java.lang.Thread.run(Thread.java:748)


2020-12-16 18:15:04,056 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Failed to reload allocations file
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException: The configuration settings for root.users are invalid. A queue element that contains child queue elements or that has the type='parent' attribute cannot also include a maxAMShare element.
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:238)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:221)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.parse(AllocationFileQueueParser.java:97)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:257)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.reinitialize(FairScheduler.java:1571)
at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:438)
at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:409)
at org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceManagerAdministrationProtocolPBServiceImpl.refreshQueues(ResourceManagerAdministrationProtocolPBServiceImpl.java:120)
at org.apache.hadoop.yarn.proto.ResourceManagerAdministrationProtocol$ResourceManagerAdministrationProtocolService$2.callBlockingMethod(ResourceManagerAdministrationProtocol.java:293)
at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:537)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1086)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1035)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:963)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2966)
{code}

Now, update FS XML such that maxAMShare is not set for root.users but set for a parent queue which is not explicitly tagged as one with "type=parent":

{code}
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<allocations>
    <queue name="root">
        <weight>1.0</weight>
        <schedulingPolicy>drf</schedulingPolicy>
        <aclSubmitApps>*</aclSubmitApps>
        <aclAdministerApps>*</aclAdministerApps>
        <queue name="default">
            <weight>1.0</weight>
            <schedulingPolicy>drf</schedulingPolicy>
        </queue>
        <queue name="users" type="parent">
            <weight>1.0</weight>
            <schedulingPolicy>drf</schedulingPolicy>
        </queue>
        <queue name="blah">
            <weight>1.0</weight>
            <schedulingPolicy>drf</schedulingPolicy>
            <maxAMShare>0.76</maxAMShare><-----------------------------------------------------Set maxAMShare for root.blah which is a parent queue to root.blah.child. This is no good as well.
            <queue name="child">
                <weight>1.0</weight>
                <schedulingPolicy>drf</schedulingPolicy>
            </queue>
        </queue>
        <queue name="blah2" type="parent">
            <weight>1.0</weight>
            <schedulingPolicy>drf</schedulingPolicy>
            <queue name="child2">
                <weight>1.0</weight>
                <schedulingPolicy>drf</schedulingPolicy>
            </queue>
        </queue>
    </queue>
    <defaultQueueSchedulingPolicy>fair</defaultQueueSchedulingPolicy>
    <queueMaxAMShareDefault>0.75</queueMaxAMShareDefault>
    <queuePlacementPolicy>
        <rule name="specified" create="true"/>
        <rule name="nestedUserQueue" create="true">
            <rule name="default" create="true" queue="users"/>
        </rule>
    </queuePlacementPolicy>
</allocations>
{code}

{code}
% bin/yarn rmadmin -refreshQueues
{code}

{code}
2020-12-16 18:20:49,345 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Failed to reload allocations file
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException: The configuration settings for root.blah are invalid. A queue element that contains child queue elements or that has the type='parent' attribute cannot also include a maxAMShare element.
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:238)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:221)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.parse(AllocationFileQueueParser.java:97)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:257)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.reinitialize(FairScheduler.java:1571)
at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:438)
at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:409)
at org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceManagerAdministrationProtocolPBServiceImpl.refreshQueues(ResourceManagerAdministrationProtocolPBServiceImpl.java:120)
at org.apache.hadoop.yarn.proto.ResourceManagerAdministrationProtocol$ResourceManagerAdministrationProtocolService$2.callBlockingMethod(ResourceManagerAdministrationProtocol.java:293)
at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:537)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1086)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1035)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:963)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2966)

2020-12-16 18:20:49,937 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService: Failed to reload fair scheduler config file - will use existing allocations.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException: The configuration settings for root.blah are invalid. A queue element that contains child queue elements or that has the type='parent' attribute cannot also include a maxAMShare element.
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:238)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:221)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.parse(AllocationFileQueueParser.java:97)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:257)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.lambda$serviceInit$0(AllocationFileLoaderService.java:128)
at java.lang.Thread.run(Thread.java:748)
{code}

Now, stop RM and restart RM. RM should fail to start:

{code}
2020-12-16 18:20:49,343 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService: Loading allocation file file:/Users/sidtheadmin/Cloudera/hadoop/hadoop-dist/target/hadoop-3.4.0-SNAPSHOT/etc/hadoop/fair-scheduler.xml
2020-12-16 18:20:49,345 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Failed to reload allocations file
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException: The configuration settings for root.blah are invalid. A queue element that contains child queue elements or that has the type='parent' attribute cannot also include a maxAMShare element.
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:238)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:221)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.parse(AllocationFileQueueParser.java:97)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:257)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.reinitialize(FairScheduler.java:1571)
at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:438)
at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:409)
at org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceManagerAdministrationProtocolPBServiceImpl.refreshQueues(ResourceManagerAdministrationProtocolPBServiceImpl.java:120)
at org.apache.hadoop.yarn.proto.ResourceManagerAdministrationProtocol$ResourceManagerAdministrationProtocolService$2.callBlockingMethod(ResourceManagerAdministrationProtocol.java:293)
at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:537)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1086)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1035)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:963)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2966)
2020-12-16 18:20:49,934 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService: Loading allocation file file:/Users/sidtheadmin/Cloudera/hadoop/hadoop-dist/target/hadoop-3.4.0-SNAPSHOT/etc/hadoop/fair-scheduler.xml
2020-12-16 18:20:49,937 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService: Failed to reload fair scheduler config file - will use existing allocations.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException: The configuration settings for root.blah are invalid. A queue element that contains child queue elements or that has the type='parent' attribute cannot also include a maxAMShare element.
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:238)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:221)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.parse(AllocationFileQueueParser.java:97)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:257)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.lambda$serviceInit$0(AllocationFileLoaderService.java:128)
at java.lang.Thread.run(Thread.java:748)
{code}

Therefore, from above, if RM is currently running and a bad config is applied through refreshQueues, then, RM continues to function with still the old settings in use as the new (bad) one is not accepted.

However, if the RM is restarted with a bad setting, then, it fails fast. Again, this behaviour is the same as the reservation element.

FWIW, I deleted an existing newline in the loadQueue() method. Even though this is not specifically concerning the fixes for this issue, this was done to prevent the checkstyle error of method length exceeding 150 lines. It was not worth refactoring anything existing to prevent this error so the easiest way out was to just delete the redundant newline.

I have also implemented the JUnits and tested them thoroughly.

> maxAMShare should only be accepted for leaf queues, not parent queues
> ---------------------------------------------------------------------
>
>                 Key: YARN-10528
>                 URL: https://issues.apache.org/jira/browse/YARN-10528
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Siddharth Ahuja
>            Assignee: Siddharth Ahuja
>            Priority: Major
>         Attachments: YARN-10528.001.patch, maxAMShare for root.users (parent queue) has no effect as child queue does not inherit it.png
>
>
> Based on [Hadoop documentation|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html], it is clear that {{maxAMShare}} property can only be used for *leaf queues*. This is similar to the {{reservation}} setting.
> However, existing code only ensures that the reservation setting is not accepted for "parent" queues (see https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L226 and https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L233) but it is missing the checks for {{maxAMShare}}. Due to this, it is currently possible to have an allocation similar to below:
> {code}
> <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
> <allocations>
>     <queue name="root">
>         <weight>1.0</weight>
>         <schedulingPolicy>drf</schedulingPolicy>
>         <aclSubmitApps>*</aclSubmitApps>
>         <aclAdministerApps>*</aclAdministerApps>
>         <queue name="default">
>             <weight>1.0</weight>
>             <schedulingPolicy>drf</schedulingPolicy>
>         </queue>
>         <queue name="users" type="parent">
>             <weight>1.0</weight>
>             <schedulingPolicy>drf</schedulingPolicy>
>             <maxAMShare>1.0</maxAMShare>
>         </queue>
>     </queue>
>     <defaultQueueSchedulingPolicy>fair</defaultQueueSchedulingPolicy>
>     <queuePlacementPolicy>
>         <rule name="specified" create="true"/>
>         <rule name="nestedUserQueue" create="true">
>             <rule name="default" create="true" queue="users"/>
>         </rule>
>         <rule name="default"/>
>     </queuePlacementPolicy>
> </allocations>
> {code}
> where {{maxAMShare}} is 1.0f meaning, it is possible allocate 100% of the queue's resources for Application Masters. Notice above that root.users is a parent queue, however, it still gladly accepts {{maxAMShare}}. This is contrary to the documentation and in fact, it is very misleading because the child queues like root.users.<user> actually do not inherit this setting at all and they still go on and use the default of 0.5 instead of 1.0, see the attached screenshot as an example.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org