You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@yunikorn.apache.org by GitBox <gi...@apache.org> on 2020/10/02 18:00:54 UTC

[GitHub] [incubator-yunikorn-core] bmv126 opened a new pull request #213: [YUNIKORN-434] Fix for scheduler crash when the queueName in pod spec doesn't contain DOT

bmv126 opened a new pull request #213:
URL: https://github.com/apache/incubator-yunikorn-core/pull/213


   Scheduler pod crashes when queue name in pod spec does not contain DOT.
   
   It fails with below error:
   
   `2020-10-02T13:49:59.871Z        DEBUG   scheduler/scheduler.go:230      enqueued event  {"eventType": "*schedulerevent.SchedulerApplicationsUpdateEvent", "event": {"AddedApplications":[{"ApplicationID":"application-sleep-test","Partition":"[mycluster]default","QueueName":"test","SubmissionTime":1601646599870976327}],"RemovedApplications":null}, "currentQueueSize": 0}
   panic: runtime error: slice bounds out of range
   
   goroutine 69 [running]:
   github.com/apache/incubator-yunikorn-core/pkg/scheduler.(*partitionSchedulingContext).createSchedulingQueue(0xc0001fe7e0, 0xc000043cfc, 0x4, 0xc000593a58, 0x7, 0xc0038e42f0, 0x1, 0x1, 0x0, 0xfffffff1886e0900)
           /home/travis/gopath/pkg/mod/github.com/apache/incubator-yunikorn-core@v0.0.0-20200909001345-c2699e725c8a/pkg/scheduler/scheduling_partition.go:237 +0x8e9
   github.com/apache/incubator-yunikorn-core/pkg/scheduler.(*partitionSchedulingContext).addSchedulingApplication(0xc0001fe7e0, 0xc0036bec60, 0x0, 0x0)
           /home/travis/gopath/pkg/mod/github.com/apache/incubator-yunikorn-core@v0.0.0-20200909001345-c2699e725c8a/pkg/scheduler/scheduling_partition.go:124 +0x670
   github.com/apache/incubator-yunikorn-core/pkg/scheduler.(*ClusterSchedulingContext).addSchedulingApplication(0xc0004508a0, 0xc0036bec60, 0x0, 0x0)
           /home/travis/gopath/pkg/mod/github.com/apache/incubator-yunikorn-core@v0.0.0-20200909001345-c2699e725c8a/pkg/scheduler/scheduling_context.go:114 +0x105
   github.com/apache/incubator-yunikorn-core/pkg/scheduler.(*Scheduler).addNewApplication(0xc00022a000, 0xc0000a1100, 0xc003936381, 0x9)
           /home/travis/gopath/pkg/mod/github.com/apache/incubator-yunikorn-core@v0.0.0-20200909001345-c2699e725c8a/pkg/scheduler/scheduler.go:209 +0x11b
   github.com/apache/incubator-yunikorn-core/pkg/scheduler.(*Scheduler).processApplicationUpdateEvent(0xc00022a000, 0xc0063c22a0)
           /home/travis/gopath/pkg/mod/github.com/apache/incubator-yunikorn-core@v0.0.0-20200909001345-c2699e725c8a/pkg/scheduler/scheduler.go:447 +0x76e
   github.com/apache/incubator-yunikorn-core/pkg/scheduler.(*Scheduler).handleSchedulerEvent(0xc00022a000)
           /home/travis/gopath/pkg/mod/github.com/apache/incubator-yunikorn-core@v0.0.0-20200909001345-c2699e725c8a/pkg/scheduler/scheduler.go:596 +0x3e5
   created by github.com/apache/incubator-yunikorn-core/pkg/scheduler.(*Scheduler).StartService
           /home/travis/gopath/pkg/mod/github.com/apache/incubator-yunikorn-core@v0.0.0-20200909001345-c2699e725c8a/pkg/scheduler/scheduler.go:67 +0x6d`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] yangwwei commented on pull request #213: [YUNIKORN-434] Fix for scheduler crash when the queueName in pod spec doesn't contain DOT

Posted by GitBox <gi...@apache.org>.
yangwwei commented on pull request #213:
URL: https://github.com/apache/incubator-yunikorn-core/pull/213#issuecomment-704026068


   Thanks @wilfred-s . I guess for this issue, we can apply the workaround you mentioned. The ultimate fix will be adding the default placement rule in YUNIKORN-19, is this correct?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] yangwwei commented on pull request #213: [YUNIKORN-434] Fix for scheduler crash when the queueName in pod spec doesn't contain DOT

Posted by GitBox <gi...@apache.org>.
yangwwei commented on pull request #213:
URL: https://github.com/apache/incubator-yunikorn-core/pull/213#issuecomment-705990361


   hi @bmv126 do you want to update the PR according to @wilfred-s 's [this comment](https://github.com/apache/incubator-yunikorn-core/pull/213#issuecomment-703963453)? The ultimate fix needs more change, we can do that later. Pls let me know, thanks!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] yangwwei commented on pull request #213: [YUNIKORN-434] Fix for scheduler crash when the queueName in pod spec doesn't contain DOT

Posted by GitBox <gi...@apache.org>.
yangwwei commented on pull request #213:
URL: https://github.com/apache/incubator-yunikorn-core/pull/213#issuecomment-702972926


   Hi @bmv126 
   
   Thanks for raising this up. Good catch.
   Is the NPE happening at https://github.com/apache/incubator-yunikorn-core/blob/29d6c82323020773d75c37e2f81000115f01d43d/pkg/scheduler/scheduling_partition.go#L237?
   Can we do the check in `createSchedulingQueue()`?  We need to create some UT to cover this function as well, currently, it seems to be missing. Please take a look if we can add some UT `scheduling_partition_test.go`
   
   Thanks!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] yangwwei commented on pull request #213: [YUNIKORN-434] Fix for scheduler crash when the queueName in pod spec doesn't contain DOT

Posted by GitBox <gi...@apache.org>.
yangwwei commented on pull request #213:
URL: https://github.com/apache/incubator-yunikorn-core/pull/213#issuecomment-703947249


   Add a default placement rule sounds like a good idea, that gives us a more consistent way of handling app submission. This will be a bigger change.
   
   But even we are using that approach, I guess it is still needed to fix the NPE in `createSchedulingQueue ()`, correct? I mean this could happen even in the case while calling by the placement rule logic. @wilfred-s  could you please take another look and let us know if this is needed?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] wilfred-s commented on pull request #213: [YUNIKORN-434] Fix for scheduler crash when the queueName in pod spec doesn't contain DOT

Posted by GitBox <gi...@apache.org>.
wilfred-s commented on pull request #213:
URL: https://github.com/apache/incubator-yunikorn-core/pull/213#issuecomment-703376485


   We should not crash but the analysis is I think not correct. The dot does not have to be part of the queue name in the pod spec. In this case without placement rules we should not be calling the `createSchedulingQueue()` at all.
   
   The scheduler does not have a placement rule specified based on the jira. That means there is no placement rule executed when the application is submitted. The queue in the pod spec must comply to one specific rule: it must exist and be fully qualified. If that is not the case we should reject the submission. We should never go and create a queue with the placement rules unless we make that behavioural change and document it.
   
   If we want to become more flexible we could run it by default through a placement rule that mimics that kind of behaviour but still does all the checks needed. That is why I filed [YUNIKORN-19](https://issues.apache.org/jira/browse/YUNIKORN-19). If no rules are given in the config run it through the `provided` rule with a create flag set to `false`. That will handle the queue not being qualified and check that the queue exists or not.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] yangwwei commented on pull request #213: [YUNIKORN-434] Fix for scheduler crash when the queueName in pod spec doesn't contain DOT

Posted by GitBox <gi...@apache.org>.
yangwwei commented on pull request #213:
URL: https://github.com/apache/incubator-yunikorn-core/pull/213#issuecomment-705990361


   hi @bmv126 do you want to update the PR according to @wilfred-s 's [this comment](https://github.com/apache/incubator-yunikorn-core/pull/213#issuecomment-703963453)? The ultimate fix needs more change, we can do that later. Pls let me know, thanks!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] wilfred-s closed pull request #213: [YUNIKORN-434] Add check for placement rule not defined and queue not found

Posted by GitBox <gi...@apache.org>.
wilfred-s closed pull request #213:
URL: https://github.com/apache/incubator-yunikorn-core/pull/213


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] wilfred-s commented on pull request #213: [YUNIKORN-434] Fix for scheduler crash when the queueName in pod spec doesn't contain DOT

Posted by GitBox <gi...@apache.org>.
wilfred-s commented on pull request #213:
URL: https://github.com/apache/incubator-yunikorn-core/pull/213#issuecomment-704037200


   yes that is correct.
   I am happy with the code change I proposed above with a comment pointing to YUNIKORN-19 added.
   If anyone wants to do this without waiting I would recommend the non code config change. That is good for all versions and mimics the behaviour we want as a result of YUNIKORN-19


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] wilfred-s commented on pull request #213: [YUNIKORN-434] Add check for placement rule not defined and queue not found

Posted by GitBox <gi...@apache.org>.
wilfred-s commented on pull request #213:
URL: https://github.com/apache/incubator-yunikorn-core/pull/213#issuecomment-736933657


   This code was ported and merged as part of YUNIKORN-317.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] TravisBuddy commented on pull request #213: [YUNIKORN-434] Fix for scheduler crash when the queueName in pod spec doesn't contain DOT

Posted by GitBox <gi...@apache.org>.
TravisBuddy commented on pull request #213:
URL: https://github.com/apache/incubator-yunikorn-core/pull/213#issuecomment-702914566


   ## Travis tests have failed
   
   Hey @bmv126,
   Please read the following log in order to understand the failure reason.
   It'll be awesome if you fix what's wrong and commit the changes.
   
   
   ### 1st Build
   
   <a href="https:&#x2F;&#x2F;travis-ci.org&#x2F;apache&#x2F;incubator-yunikorn-core&#x2F;jobs&#x2F;732340183">View build log</a>
   
   
   <details>
     <summary>
       <strong>
        curl -sSfL https:&#x2F;&#x2F;raw.githubusercontent.com&#x2F;golangci&#x2F;golangci-lint&#x2F;master&#x2F;install.sh | sh -s -- -b $(go env GOPATH)&#x2F;bin v1.22.2
       </strong>
     </summary>
   
   ```
   golangci/golangci-lint info checking GitHub for tag 'v1.22.2'
   golangci/golangci-lint info found version: 1.22.2 for v1.22.2/linux/amd64
   golangci/golangci-lint info installed /home/travis/gopath/bin/golangci-lint
   ```
   
   </details>
   
   
   <details>
     <summary>
       <strong>
        make lint
       </strong>
     </summary>
   
   ```
   running golangci-lint
   pkg/scheduler/scheduling_partition.go:117: File is not `gofmt`-ed with `-s` (gofmt)
   	   return fmt.Errorf("failed to place app in queue %s for application %s as QueueName does not contain %s", queueName, appID, cache.DOT)
   Makefile:45: recipe for target 'lint' failed
   make: *** [lint] Error 1
   ```
   
   </details>
   
   
   ###### TravisBuddy Request Identifier: 75b1c010-04e4-11eb-a4bc-1b94c014fd22
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] bmv126 commented on pull request #213: [YUNIKORN-434] Fix for scheduler crash when the queueName in pod spec doesn't contain DOT

Posted by GitBox <gi...@apache.org>.
bmv126 commented on pull request #213:
URL: https://github.com/apache/incubator-yunikorn-core/pull/213#issuecomment-703088324


   @yangwwei 
   Yes the exception is happening at above mentioned location.
   I will add tests for this scenario.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] wilfred-s commented on pull request #213: [YUNIKORN-434] Fix for scheduler crash when the queueName in pod spec doesn't contain DOT

Posted by GitBox <gi...@apache.org>.
wilfred-s commented on pull request #213:
URL: https://github.com/apache/incubator-yunikorn-core/pull/213#issuecomment-703963453


   > But even we are using that approach, I guess it is still needed to fix the NPE in createSchedulingQueue (), correct? I mean this could happen even in the case while calling by the placement rule logic. @wilfred-s could you please take another look and let us know if this is needed?
   
   No, when the rule is in we are good as the rule makes sure that the name is fully qualified and will add the missing `root.`
   The problem is not the fact that the `.` is missing it is the fact that the name is not fully qualified. When we call out to create a queue it must be fully qualified. The rules and placement manager guarantee that it is done. It is guaranteed as it is needed for parent rules and the create flag checks run by the rules.
   
   The other point to make is that the application should be rejected if the queue does not exist and no placement rules are configured. We should never try to create a queue without the placement rules in place. Allowing random queues to be created on submit could easily work around quotas etc and limits what an administrators can do to manage the cluster.
   As a workaround for this case if the queue does not exist and we do not have rules in place we should exit placement and reject (after line 116 in the original file):
   ```
   if schedulingQueue == nil && !psc.placementManager.IsInitialised() {
       return fmt.Errorf("application cannot be submitted to non existing queue %s for application %s", 
           schedulingApp.ApplicationInfo.QueueName, appID)
   }
   ```
   A non code change would be to add this to the config:
   ```
      placementrules:
       - name: provided
         create: false
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] wilfred-s commented on pull request #213: [YUNIKORN-434] Fix for scheduler crash when the queueName in pod spec doesn't contain DOT

Posted by GitBox <gi...@apache.org>.
wilfred-s commented on pull request #213:
URL: https://github.com/apache/incubator-yunikorn-core/pull/213#issuecomment-703966972


   One other major thing why the change proposed is not correct:
   After the proposed change if I submit an application with a queue set to `myroot.wilfred` then that would work from the create call (i.e. it will not crash) However it will not work from a submit perspective. The parent queue `myroot` that is set does not exist and the submit will still fail. I would still need to make sure that the name provided starts with `root.` to get the application submitted.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org