You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@yunikorn.apache.org by GitBox <gi...@apache.org> on 2021/10/23 07:47:57 UTC

[GitHub] [incubator-yunikorn-core] manirajv06 opened a new pull request #332: YUNIKORN-905: Core side changes of YUNIKORN-337

manirajv06 opened a new pull request #332:
URL: https://github.com/apache/incubator-yunikorn-core/pull/332


   ### What is this PR for?
   
   Core side changes of YUNIKORN-337
   
   ### What type of PR is it?
   * [ ] - Improvement
   
   ### Todos
   * [ ] - Task
   
   ### What is the Jira issue?
   https://issues.apache.org/jira/browse/YUNIKORN-905
   
   ### How should this be tested?
   
   ### Screenshots (if appropriate)
   
   ### Questions:
   * [ ] - The licenses files need update.
   * [ ] - There is breaking changes for older versions.
   * [ ] - It needs documentation.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] manirajv06 commented on pull request #332: YUNIKORN-905: Core side changes of YUNIKORN-337

Posted by GitBox <gi...@apache.org>.
manirajv06 commented on pull request #332:
URL: https://github.com/apache/incubator-yunikorn-core/pull/332#issuecomment-961670398






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] codecov[bot] commented on pull request #332: YUNIKORN-905: Core side changes of YUNIKORN-337

Posted by GitBox <gi...@apache.org>.
codecov[bot] commented on pull request #332:
URL: https://github.com/apache/incubator-yunikorn-core/pull/332#issuecomment-961668085


   # [Codecov](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/332?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#332](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/332?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (3bb0bde) into [master](https://codecov.io/gh/apache/incubator-yunikorn-core/commit/5a1c19e280f66d6cb4fc206a675116260d6f2ff1?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (5a1c19e) will **increase** coverage by `3.98%`.
   > The diff coverage is `67.40%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/332/graphs/tree.svg?width=650&height=150&src=pr&token=SB9NrIi3Hy&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/332?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master     #332      +/-   ##
   ==========================================
   + Coverage   63.46%   67.44%   +3.98%     
   ==========================================
     Files          60       63       +3     
     Lines        5220     6086     +866     
   ==========================================
   + Hits         3313     4105     +792     
   - Misses       1747     1764      +17     
   - Partials      160      217      +57     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/332?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [pkg/metrics/init.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/332/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGtnL21ldHJpY3MvaW5pdC5nbw==) | `62.96% <ø> (ø)` | |
   | [pkg/scheduler/drf\_preemption\_policy.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/332/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGtnL3NjaGVkdWxlci9kcmZfcHJlZW1wdGlvbl9wb2xpY3kuZ28=) | `0.00% <0.00%> (ø)` | |
   | [pkg/scheduler/nodes\_usage\_monitor.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/332/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGtnL3NjaGVkdWxlci9ub2Rlc191c2FnZV9tb25pdG9yLmdv) | `0.00% <0.00%> (ø)` | |
   | [pkg/scheduler/objects/allocation.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/332/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGtnL3NjaGVkdWxlci9vYmplY3RzL2FsbG9jYXRpb24uZ28=) | `100.00% <ø> (ø)` | |
   | [pkg/scheduler/objects/node\_iterator.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/332/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGtnL3NjaGVkdWxlci9vYmplY3RzL25vZGVfaXRlcmF0b3IuZ28=) | `100.00% <ø> (ø)` | |
   | [pkg/scheduler/objects/sorters.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/332/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGtnL3NjaGVkdWxlci9vYmplY3RzL3NvcnRlcnMuZ28=) | `100.00% <ø> (ø)` | |
   | [pkg/scheduler/scheduler.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/332/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGtnL3NjaGVkdWxlci9zY2hlZHVsZXIuZ28=) | `0.00% <0.00%> (ø)` | |
   | [pkg/webservice/webservice.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/332/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGtnL3dlYnNlcnZpY2Uvd2Vic2VydmljZS5nbw==) | `32.25% <ø> (+19.35%)` | :arrow_up: |
   | [pkg/scheduler/context.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/332/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGtnL3NjaGVkdWxlci9jb250ZXh0Lmdv) | `16.55% <27.81%> (+10.28%)` | :arrow_up: |
   | [pkg/metrics/metrics\_collector.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/332/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGtnL21ldHJpY3MvbWV0cmljc19jb2xsZWN0b3IuZ28=) | `64.51% <45.00%> (-9.40%)` | :arrow_down: |
   | ... and [33 more](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/332/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/332?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/332?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [66eaecc...3bb0bde](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/332?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] manirajv06 commented on pull request #332: YUNIKORN-905: Core side changes of YUNIKORN-337

Posted by GitBox <gi...@apache.org>.
manirajv06 commented on pull request #332:
URL: https://github.com/apache/incubator-yunikorn-core/pull/332#issuecomment-961670398


   Fixed unit test failures. Most of the Node registration failures occured because app doesn't exist around node reg process happens. Ensure app existence by calling UpdateApplication before the UpdateNode fix these problems.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] wilfred-s commented on pull request #332: YUNIKORN-905: Core side changes of YUNIKORN-337

Posted by GitBox <gi...@apache.org>.
wilfred-s commented on pull request #332:
URL: https://github.com/apache/incubator-yunikorn-core/pull/332#issuecomment-960371951


   Can you please have a look at the unit test failures, these might be just test implementation issues but the file has changed (all failures in recovery_test.go):
   --- FAIL: TestFailReplacePlaceholder (0.08s)
       partition_test.go:1706: assertion failed: node-1 (alloc.NodeID string) != node-2 (nodeID2 string): should be allocated on node-2
   --- FAIL: TestSchedulerRecovery (2.10s)
       mock_rm_callback_test.go:164: assertion failed: error is not nil: timeout waiting for condition: Failed to wait for node state to become accepted: node-1:1234, called from: TestSchedulerRecovery in recovery_test.go:283
   --- FAIL: TestSchedulerRecovery2Allocations (1.37s)
       mock_rm_callback_test.go:164: assertion failed: error is not nil: timeout waiting for condition: Failed to wait for node state to become accepted: node-1:1234, called from: TestSchedulerRecovery2Allocations in recovery_test.go:454
   --- FAIL: TestSchedulerRecoveryWithoutAppInfo (1.02s)
       mock_rm_callback_test.go:164: assertion failed: error is not nil: timeout waiting for condition: Failed to wait for node state to become accepted: node-1:1234, called from: TestSchedulerRecoveryWithoutAppInfo in recovery_test.go:579


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] wilfred-s edited a comment on pull request #332: YUNIKORN-905: Core side changes of YUNIKORN-337

Posted by GitBox <gi...@apache.org>.
wilfred-s edited a comment on pull request #332:
URL: https://github.com/apache/incubator-yunikorn-core/pull/332#issuecomment-962774771


   Based on the unit test failures we must make sure that the order in the shim is correct. First recover apps then the nodes.
   
   Looking at the change we might have had an issue in the RMProxy for a long time. I do think that we need to add a retry in the node update when we recover the node. Even in the previous implementation there was no guarantee that all the application were added before a node was recovered. The tests in the unit tests used the order processing dependency to make sure it worked. 
   
   There was _never_ an order requirement on the events send by a shim or a use of complex updates events to support this ordering by the shim. An event to recover a node could be a separate UpdateRequest from the applications that should be recovered. That means we relied on the go routine ordering to hopefully do things correctly: i.e. events send by the shim to create new apps would be processed before node recovery started.
   
   That is a dangerous assumption: filing a follow up jira.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] manirajv06 edited a comment on pull request #332: YUNIKORN-905: Core side changes of YUNIKORN-337

Posted by GitBox <gi...@apache.org>.
manirajv06 edited a comment on pull request #332:
URL: https://github.com/apache/incubator-yunikorn-core/pull/332#issuecomment-950113848


   Notes (Below info applies for other related PR's as well):
   
   Made changes only for Section 1 (SECTION 1: Splitting Messages):
   
   UpdateRequest has been broken down into 
   1. UpdateAllocation
   2. UpdateApplication
   3. UpdateNode
   
   To do:
   
   1. Need to see if we can merge ReloadConfiguration and UpdateConfiguration.
   2. Have to take care of "SECTION 2: Breaking the Shim build dependency on Core by moving the plugins to Scheduler Interface" completely.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] wilfred-s commented on a change in pull request #332: YUNIKORN-905: Core side changes of YUNIKORN-337

Posted by GitBox <gi...@apache.org>.
wilfred-s commented on a change in pull request #332:
URL: https://github.com/apache/incubator-yunikorn-core/pull/332#discussion_r741613094



##########
File path: pkg/plugins/types.go
##########
@@ -21,49 +21,11 @@ package plugins
 import (
 	"sync"
 
-	"github.com/apache/incubator-yunikorn-scheduler-interface/lib/go/si"
+	"github.com/apache/incubator-yunikorn-scheduler-interface/lib/go/api"
 )
 
 type SchedulerPlugins struct {
-	predicatesPlugin       PredicatesPlugin
-	reconcilePlugin        ReconcilePlugin
-	eventPlugin            EventPlugin
-	schedulingStateUpdater ContainerSchedulingStateUpdater
-	configPlugin           ConfigurationPlugin
+	ResourceManagerCallbackPlugin	api.ResourceManagerCallback
 
 	sync.RWMutex
-}
-
-// RM side implements this API when it can provide plugin for predicates.
-type PredicatesPlugin interface {
-	// Run a certain set of predicate functions to determine if a proposed allocation
-	// can be allocated onto a node.
-	Predicates(args *si.PredicatesArgs) error
-}
-
-type ReconcilePlugin interface {
-	// RM side implements this API when it can provide plugin for reconciling
-	// Re-sync scheduler cache can sync some in-cache (yunikorn-core side) state changes
-	// to scheduler cache (shim-side), such as assumed allocations.
-	ReSyncSchedulerCache(args *si.ReSyncSchedulerCacheArgs) error
-}
-
-type EventPlugin interface {
-	// This plugin is responsible for transmitting events to the shim side.
-	// Events can be further exposed from the shim.
-	SendEvent(events []*si.EventRecord)
-}
-
-// Scheduler core can update container scheduling state to the RM,
-// the shim side can determine what to do incorporate with the scheduling state
-type ContainerSchedulingStateUpdater interface {
-	// update container scheduling state to the shim side
-	// this might be called even the container scheduling state is unchanged
-	// the shim side cannot assume to only receive updates on state changes
-	// the shim side implementation must be thread safe
-	Update(request *si.UpdateContainerSchedulingStateRequest)
-}
-
-type ConfigurationPlugin interface {
-	UpdateConfiguration(args *si.UpdateConfigurationRequest) *si.UpdateConfigurationResponse
-}
+}

Review comment:
       nit: missing newline

##########
File path: pkg/scheduler/context.go
##########
@@ -201,31 +201,106 @@ func (cc *ClusterContext) processRMConfigUpdateEvent(event *rmevent.RMConfigUpda
 	configs.ConfigContext.Set(cc.policyGroup, conf)
 }
 
-// Main update processing: the RM passes a large multi part update which needs to be unravelled.
-// Order of following operations is fixed, don't change unless carefully thought through.
-// 1) applications
-// 2) allocations on existing applications
-// 3) nodes
-// Updating allocations on existing applications requires the application to exist.
-// Node updates include recovered nodes which are linked to applications that must exist.
-func (cc *ClusterContext) processRMUpdateEvent(event *rmevent.RMUpdateRequestEvent) {
+func (cc *ClusterContext) handleRMUpdateNodeEvent(event *rmevent.RMUpdateNodeEvent) {
 	request := event.Request
-	// 1) Add / remove app requested by RM.
-	cc.processApplications(request)
-	// 2) Add new request, release allocation, cancel request
-	cc.processAllocations(request)
-	// 3) Add / remove / update Nodes
 	cc.processNodes(request)
 }
 
-func (cc *ClusterContext) processNodes(request *si.UpdateRequest) {
-	// 3) Add / remove / update Nodes
-	// Process add node
-	if len(request.NewSchedulableNodes) > 0 {
-		cc.addNodes(request)
-	}
-	if len(request.UpdatedNodes) > 0 {
-		cc.updateNodes(request)
+func (cc *ClusterContext) processNodes(request *si.NodeRequest) {
+	// 1) Add / remove / update Nodes
+	if len(request.GetNodes()) > 0 {
+		acceptedNodes := make([]*si.AcceptedNode, 0)
+		rejectedNodes := make([]*si.RejectedNode, 0)
+		for _, nodeInfo := range request.GetNodes() {
+			switch nodeInfo.Action {
+			case si.NodeInfo_CREATE:
+				sn := objects.NewNode(nodeInfo)

Review comment:
       Can we move this create into its own method? Similar to the way `updateNode` is called for all update cases? Makes the code more readable:
   ```
   // addNode adds a node to the cluster returns error if the action fails or nil on success
   func (cc *ClusterContext) addNode(nodeInfo *si.NodeInfo) error {
   }
   ```
   If the response is a an error the error message can be used in the reject state otherwise it is an accepted node.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] manirajv06 commented on pull request #332: YUNIKORN-905: Core side changes of YUNIKORN-337

Posted by GitBox <gi...@apache.org>.
manirajv06 commented on pull request #332:
URL: https://github.com/apache/incubator-yunikorn-core/pull/332#issuecomment-950113848


   Notes:
   
   Made changes only for Section 1 (SECTION 1: Splitting Messages):
   
   UpdateRequest has been broken down into 
   1. UpdateAllocation
   2. UpdateApplication
   3. UpdateNode
   
   To do:
   
   1. Need to see if we can merge ReloadConfiguration and UpdateConfiguration.
   2. Have to take care of "SECTION 2: Breaking the Shim build dependency on Core by moving the plugins to Scheduler Interface" completely.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] manirajv06 commented on pull request #332: YUNIKORN-905: Core side changes of YUNIKORN-337

Posted by GitBox <gi...@apache.org>.
manirajv06 commented on pull request #332:
URL: https://github.com/apache/incubator-yunikorn-core/pull/332#issuecomment-958647563


   > 1. Need to discuss and see how we can break the dependency completely as there are some places in shim where code refers core directly.
   
   Filed https://issues.apache.org/jira/browse/YUNIKORN-930 to handle this as part of 1.1.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] manirajv06 commented on pull request #332: YUNIKORN-905: Core side changes of YUNIKORN-337

Posted by GitBox <gi...@apache.org>.
manirajv06 commented on pull request #332:
URL: https://github.com/apache/incubator-yunikorn-core/pull/332#issuecomment-958647563






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] manirajv06 commented on pull request #332: YUNIKORN-905: Core side changes of YUNIKORN-337

Posted by GitBox <gi...@apache.org>.
manirajv06 commented on pull request #332:
URL: https://github.com/apache/incubator-yunikorn-core/pull/332#issuecomment-958647563


   > 1. Need to discuss and see how we can break the dependency completely as there are some places in shim where code refers core directly.
   
   Filed https://issues.apache.org/jira/browse/YUNIKORN-930 to handle this as part of 1.1.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] wilfred-s commented on pull request #332: YUNIKORN-905: Core side changes of YUNIKORN-337

Posted by GitBox <gi...@apache.org>.
wilfred-s commented on pull request #332:
URL: https://github.com/apache/incubator-yunikorn-core/pull/332#issuecomment-963715464


   Sorry for the delay in committing, all done now


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] wilfred-s closed pull request #332: YUNIKORN-905: Core side changes of YUNIKORN-337

Posted by GitBox <gi...@apache.org>.
wilfred-s closed pull request #332:
URL: https://github.com/apache/incubator-yunikorn-core/pull/332


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] manirajv06 edited a comment on pull request #332: YUNIKORN-905: Core side changes of YUNIKORN-337

Posted by GitBox <gi...@apache.org>.
manirajv06 edited a comment on pull request #332:
URL: https://github.com/apache/incubator-yunikorn-core/pull/332#issuecomment-950113848


   Since this is an WIP pr, please refer the below notes(Below info applies for other related PR's as well):
   
   Made changes only for Section 1 (SECTION 1: Splitting Messages):
   
   UpdateRequest has been broken down into 
   1. UpdateAllocation
   2. UpdateApplication
   3. UpdateNode
   
   To do:
   
   1. Need to see if we can merge ReloadConfiguration and UpdateConfiguration.
   2. Have to take care of "SECTION 2: Breaking the Shim build dependency on Core by moving the plugins to Scheduler Interface" completely.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] manirajv06 commented on pull request #332: YUNIKORN-905: Core side changes of YUNIKORN-337

Posted by GitBox <gi...@apache.org>.
manirajv06 commented on pull request #332:
URL: https://github.com/apache/incubator-yunikorn-core/pull/332#issuecomment-951009241


   Since this is an WIP pr, please refer the below notes(Below info applies for other related PR's as well) for this commit:
   
   Made changes only for "SECTION 2: Breaking the Shim build dependency on Core by moving the plugins to Scheduler Interface"
   
   * Moved all plugins from core to appropriate place in SI under single common existing interface itself rather than having different interface as I though this model is clean and good. Let me know if you think otherwise.
   * As opposed to introducing new events (as documented) to go through via rmproxy, made the changes to call those plugin directly from the core, node etc as I thought add more new events is unnecessary.
   
   To do:
   
   1. Need to discuss and see how we can break the dependency completely as there are some places in shim where code refers core directly.
   2. Need to see if we can merge ReloadConfiguration and UpdateConfiguration.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] wilfred-s commented on a change in pull request #332: YUNIKORN-905: Core side changes of YUNIKORN-337

Posted by GitBox <gi...@apache.org>.
wilfred-s commented on a change in pull request #332:
URL: https://github.com/apache/incubator-yunikorn-core/pull/332#discussion_r741613094



##########
File path: pkg/plugins/types.go
##########
@@ -21,49 +21,11 @@ package plugins
 import (
 	"sync"
 
-	"github.com/apache/incubator-yunikorn-scheduler-interface/lib/go/si"
+	"github.com/apache/incubator-yunikorn-scheduler-interface/lib/go/api"
 )
 
 type SchedulerPlugins struct {
-	predicatesPlugin       PredicatesPlugin
-	reconcilePlugin        ReconcilePlugin
-	eventPlugin            EventPlugin
-	schedulingStateUpdater ContainerSchedulingStateUpdater
-	configPlugin           ConfigurationPlugin
+	ResourceManagerCallbackPlugin	api.ResourceManagerCallback
 
 	sync.RWMutex
-}
-
-// RM side implements this API when it can provide plugin for predicates.
-type PredicatesPlugin interface {
-	// Run a certain set of predicate functions to determine if a proposed allocation
-	// can be allocated onto a node.
-	Predicates(args *si.PredicatesArgs) error
-}
-
-type ReconcilePlugin interface {
-	// RM side implements this API when it can provide plugin for reconciling
-	// Re-sync scheduler cache can sync some in-cache (yunikorn-core side) state changes
-	// to scheduler cache (shim-side), such as assumed allocations.
-	ReSyncSchedulerCache(args *si.ReSyncSchedulerCacheArgs) error
-}
-
-type EventPlugin interface {
-	// This plugin is responsible for transmitting events to the shim side.
-	// Events can be further exposed from the shim.
-	SendEvent(events []*si.EventRecord)
-}
-
-// Scheduler core can update container scheduling state to the RM,
-// the shim side can determine what to do incorporate with the scheduling state
-type ContainerSchedulingStateUpdater interface {
-	// update container scheduling state to the shim side
-	// this might be called even the container scheduling state is unchanged
-	// the shim side cannot assume to only receive updates on state changes
-	// the shim side implementation must be thread safe
-	Update(request *si.UpdateContainerSchedulingStateRequest)
-}
-
-type ConfigurationPlugin interface {
-	UpdateConfiguration(args *si.UpdateConfigurationRequest) *si.UpdateConfigurationResponse
-}
+}

Review comment:
       nit: missing newline

##########
File path: pkg/scheduler/context.go
##########
@@ -201,31 +201,106 @@ func (cc *ClusterContext) processRMConfigUpdateEvent(event *rmevent.RMConfigUpda
 	configs.ConfigContext.Set(cc.policyGroup, conf)
 }
 
-// Main update processing: the RM passes a large multi part update which needs to be unravelled.
-// Order of following operations is fixed, don't change unless carefully thought through.
-// 1) applications
-// 2) allocations on existing applications
-// 3) nodes
-// Updating allocations on existing applications requires the application to exist.
-// Node updates include recovered nodes which are linked to applications that must exist.
-func (cc *ClusterContext) processRMUpdateEvent(event *rmevent.RMUpdateRequestEvent) {
+func (cc *ClusterContext) handleRMUpdateNodeEvent(event *rmevent.RMUpdateNodeEvent) {
 	request := event.Request
-	// 1) Add / remove app requested by RM.
-	cc.processApplications(request)
-	// 2) Add new request, release allocation, cancel request
-	cc.processAllocations(request)
-	// 3) Add / remove / update Nodes
 	cc.processNodes(request)
 }
 
-func (cc *ClusterContext) processNodes(request *si.UpdateRequest) {
-	// 3) Add / remove / update Nodes
-	// Process add node
-	if len(request.NewSchedulableNodes) > 0 {
-		cc.addNodes(request)
-	}
-	if len(request.UpdatedNodes) > 0 {
-		cc.updateNodes(request)
+func (cc *ClusterContext) processNodes(request *si.NodeRequest) {
+	// 1) Add / remove / update Nodes
+	if len(request.GetNodes()) > 0 {
+		acceptedNodes := make([]*si.AcceptedNode, 0)
+		rejectedNodes := make([]*si.RejectedNode, 0)
+		for _, nodeInfo := range request.GetNodes() {
+			switch nodeInfo.Action {
+			case si.NodeInfo_CREATE:
+				sn := objects.NewNode(nodeInfo)

Review comment:
       Can we move this create into its own method? Similar to the way `updateNode` is called for all update cases? Makes the code more readable:
   ```
   // addNode adds a node to the cluster returns error if the action fails or nil on success
   func (cc *ClusterContext) addNode(nodeInfo *si.NodeInfo) error {
   }
   ```
   If the response is a an error the error message can be used in the reject state otherwise it is an accepted node.

##########
File path: pkg/plugins/types.go
##########
@@ -21,49 +21,11 @@ package plugins
 import (
 	"sync"
 
-	"github.com/apache/incubator-yunikorn-scheduler-interface/lib/go/si"
+	"github.com/apache/incubator-yunikorn-scheduler-interface/lib/go/api"
 )
 
 type SchedulerPlugins struct {
-	predicatesPlugin       PredicatesPlugin
-	reconcilePlugin        ReconcilePlugin
-	eventPlugin            EventPlugin
-	schedulingStateUpdater ContainerSchedulingStateUpdater
-	configPlugin           ConfigurationPlugin
+	ResourceManagerCallbackPlugin	api.ResourceManagerCallback
 
 	sync.RWMutex
-}
-
-// RM side implements this API when it can provide plugin for predicates.
-type PredicatesPlugin interface {
-	// Run a certain set of predicate functions to determine if a proposed allocation
-	// can be allocated onto a node.
-	Predicates(args *si.PredicatesArgs) error
-}
-
-type ReconcilePlugin interface {
-	// RM side implements this API when it can provide plugin for reconciling
-	// Re-sync scheduler cache can sync some in-cache (yunikorn-core side) state changes
-	// to scheduler cache (shim-side), such as assumed allocations.
-	ReSyncSchedulerCache(args *si.ReSyncSchedulerCacheArgs) error
-}
-
-type EventPlugin interface {
-	// This plugin is responsible for transmitting events to the shim side.
-	// Events can be further exposed from the shim.
-	SendEvent(events []*si.EventRecord)
-}
-
-// Scheduler core can update container scheduling state to the RM,
-// the shim side can determine what to do incorporate with the scheduling state
-type ContainerSchedulingStateUpdater interface {
-	// update container scheduling state to the shim side
-	// this might be called even the container scheduling state is unchanged
-	// the shim side cannot assume to only receive updates on state changes
-	// the shim side implementation must be thread safe
-	Update(request *si.UpdateContainerSchedulingStateRequest)
-}
-
-type ConfigurationPlugin interface {
-	UpdateConfiguration(args *si.UpdateConfigurationRequest) *si.UpdateConfigurationResponse
-}
+}

Review comment:
       nit: missing newline

##########
File path: pkg/scheduler/context.go
##########
@@ -201,31 +201,106 @@ func (cc *ClusterContext) processRMConfigUpdateEvent(event *rmevent.RMConfigUpda
 	configs.ConfigContext.Set(cc.policyGroup, conf)
 }
 
-// Main update processing: the RM passes a large multi part update which needs to be unravelled.
-// Order of following operations is fixed, don't change unless carefully thought through.
-// 1) applications
-// 2) allocations on existing applications
-// 3) nodes
-// Updating allocations on existing applications requires the application to exist.
-// Node updates include recovered nodes which are linked to applications that must exist.
-func (cc *ClusterContext) processRMUpdateEvent(event *rmevent.RMUpdateRequestEvent) {
+func (cc *ClusterContext) handleRMUpdateNodeEvent(event *rmevent.RMUpdateNodeEvent) {
 	request := event.Request
-	// 1) Add / remove app requested by RM.
-	cc.processApplications(request)
-	// 2) Add new request, release allocation, cancel request
-	cc.processAllocations(request)
-	// 3) Add / remove / update Nodes
 	cc.processNodes(request)
 }
 
-func (cc *ClusterContext) processNodes(request *si.UpdateRequest) {
-	// 3) Add / remove / update Nodes
-	// Process add node
-	if len(request.NewSchedulableNodes) > 0 {
-		cc.addNodes(request)
-	}
-	if len(request.UpdatedNodes) > 0 {
-		cc.updateNodes(request)
+func (cc *ClusterContext) processNodes(request *si.NodeRequest) {
+	// 1) Add / remove / update Nodes
+	if len(request.GetNodes()) > 0 {
+		acceptedNodes := make([]*si.AcceptedNode, 0)
+		rejectedNodes := make([]*si.RejectedNode, 0)
+		for _, nodeInfo := range request.GetNodes() {
+			switch nodeInfo.Action {
+			case si.NodeInfo_CREATE:
+				sn := objects.NewNode(nodeInfo)

Review comment:
       Can we move this create into its own method? Similar to the way `updateNode` is called for all update cases? Makes the code more readable:
   ```
   // addNode adds a node to the cluster returns error if the action fails or nil on success
   func (cc *ClusterContext) addNode(nodeInfo *si.NodeInfo) error {
   }
   ```
   If the response is a an error the error message can be used in the reject state otherwise it is an accepted node.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] wilfred-s commented on a change in pull request #332: YUNIKORN-905: Core side changes of YUNIKORN-337

Posted by GitBox <gi...@apache.org>.
wilfred-s commented on a change in pull request #332:
URL: https://github.com/apache/incubator-yunikorn-core/pull/332#discussion_r741613094



##########
File path: pkg/plugins/types.go
##########
@@ -21,49 +21,11 @@ package plugins
 import (
 	"sync"
 
-	"github.com/apache/incubator-yunikorn-scheduler-interface/lib/go/si"
+	"github.com/apache/incubator-yunikorn-scheduler-interface/lib/go/api"
 )
 
 type SchedulerPlugins struct {
-	predicatesPlugin       PredicatesPlugin
-	reconcilePlugin        ReconcilePlugin
-	eventPlugin            EventPlugin
-	schedulingStateUpdater ContainerSchedulingStateUpdater
-	configPlugin           ConfigurationPlugin
+	ResourceManagerCallbackPlugin	api.ResourceManagerCallback
 
 	sync.RWMutex
-}
-
-// RM side implements this API when it can provide plugin for predicates.
-type PredicatesPlugin interface {
-	// Run a certain set of predicate functions to determine if a proposed allocation
-	// can be allocated onto a node.
-	Predicates(args *si.PredicatesArgs) error
-}
-
-type ReconcilePlugin interface {
-	// RM side implements this API when it can provide plugin for reconciling
-	// Re-sync scheduler cache can sync some in-cache (yunikorn-core side) state changes
-	// to scheduler cache (shim-side), such as assumed allocations.
-	ReSyncSchedulerCache(args *si.ReSyncSchedulerCacheArgs) error
-}
-
-type EventPlugin interface {
-	// This plugin is responsible for transmitting events to the shim side.
-	// Events can be further exposed from the shim.
-	SendEvent(events []*si.EventRecord)
-}
-
-// Scheduler core can update container scheduling state to the RM,
-// the shim side can determine what to do incorporate with the scheduling state
-type ContainerSchedulingStateUpdater interface {
-	// update container scheduling state to the shim side
-	// this might be called even the container scheduling state is unchanged
-	// the shim side cannot assume to only receive updates on state changes
-	// the shim side implementation must be thread safe
-	Update(request *si.UpdateContainerSchedulingStateRequest)
-}
-
-type ConfigurationPlugin interface {
-	UpdateConfiguration(args *si.UpdateConfigurationRequest) *si.UpdateConfigurationResponse
-}
+}

Review comment:
       nit: missing newline

##########
File path: pkg/scheduler/context.go
##########
@@ -201,31 +201,106 @@ func (cc *ClusterContext) processRMConfigUpdateEvent(event *rmevent.RMConfigUpda
 	configs.ConfigContext.Set(cc.policyGroup, conf)
 }
 
-// Main update processing: the RM passes a large multi part update which needs to be unravelled.
-// Order of following operations is fixed, don't change unless carefully thought through.
-// 1) applications
-// 2) allocations on existing applications
-// 3) nodes
-// Updating allocations on existing applications requires the application to exist.
-// Node updates include recovered nodes which are linked to applications that must exist.
-func (cc *ClusterContext) processRMUpdateEvent(event *rmevent.RMUpdateRequestEvent) {
+func (cc *ClusterContext) handleRMUpdateNodeEvent(event *rmevent.RMUpdateNodeEvent) {
 	request := event.Request
-	// 1) Add / remove app requested by RM.
-	cc.processApplications(request)
-	// 2) Add new request, release allocation, cancel request
-	cc.processAllocations(request)
-	// 3) Add / remove / update Nodes
 	cc.processNodes(request)
 }
 
-func (cc *ClusterContext) processNodes(request *si.UpdateRequest) {
-	// 3) Add / remove / update Nodes
-	// Process add node
-	if len(request.NewSchedulableNodes) > 0 {
-		cc.addNodes(request)
-	}
-	if len(request.UpdatedNodes) > 0 {
-		cc.updateNodes(request)
+func (cc *ClusterContext) processNodes(request *si.NodeRequest) {
+	// 1) Add / remove / update Nodes
+	if len(request.GetNodes()) > 0 {
+		acceptedNodes := make([]*si.AcceptedNode, 0)
+		rejectedNodes := make([]*si.RejectedNode, 0)
+		for _, nodeInfo := range request.GetNodes() {
+			switch nodeInfo.Action {
+			case si.NodeInfo_CREATE:
+				sn := objects.NewNode(nodeInfo)

Review comment:
       Can we move this create into its own method? Similar to the way `updateNode` is called for all update cases? Makes the code more readable:
   ```
   // addNode adds a node to the cluster returns error if the action fails or nil on success
   func (cc *ClusterContext) addNode(nodeInfo *si.NodeInfo) error {
   }
   ```
   If the response is a an error the error message can be used in the reject state otherwise it is an accepted node.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] manirajv06 edited a comment on pull request #332: YUNIKORN-905: Core side changes of YUNIKORN-337

Posted by GitBox <gi...@apache.org>.
manirajv06 edited a comment on pull request #332:
URL: https://github.com/apache/incubator-yunikorn-core/pull/332#issuecomment-961670398


   Fixed unit test failures. Most of the Node registration failures occured because app doesn't exist by the time node reg process reaches actual reg step. Ensuring app existence by calling UpdateApplication before the UpdateNode fix these problems.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] codecov[bot] commented on pull request #332: YUNIKORN-905: Core side changes of YUNIKORN-337

Posted by GitBox <gi...@apache.org>.
codecov[bot] commented on pull request #332:
URL: https://github.com/apache/incubator-yunikorn-core/pull/332#issuecomment-961668085






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] manirajv06 edited a comment on pull request #332: YUNIKORN-905: Core side changes of YUNIKORN-337

Posted by GitBox <gi...@apache.org>.
manirajv06 edited a comment on pull request #332:
URL: https://github.com/apache/incubator-yunikorn-core/pull/332#issuecomment-961670398


   Fixed unit test failures. Most of the Node registration failures occured because app doesn't exist by the time node reg process reaches actual reg step. Ensuring app existence by calling UpdateApplication before the UpdateNode fix these problems.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] manirajv06 edited a comment on pull request #332: YUNIKORN-905: Core side changes of YUNIKORN-337

Posted by GitBox <gi...@apache.org>.
manirajv06 edited a comment on pull request #332:
URL: https://github.com/apache/incubator-yunikorn-core/pull/332#issuecomment-961670398


   Fixed unit test failures. Most of the Node registration failures occured because app doesn't exist by the time node reg process reaches actual reg step. Ensuring app existence by calling UpdateApplication before the UpdateNode fix these problems.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] manirajv06 commented on pull request #332: YUNIKORN-905: Core side changes of YUNIKORN-337

Posted by GitBox <gi...@apache.org>.
manirajv06 commented on pull request #332:
URL: https://github.com/apache/incubator-yunikorn-core/pull/332#issuecomment-963275815


   Yes, we will need to take care of the ordering while recovering happens in shim side atleast for now to ensure this whole major changes works end to end well. However, we will need to address YUNIKORN-936 as a long term fix. Hope we are on the same page. If so, can we commit this so that I can take care of go.mod changes in shim?
   
   In fact, we skimmed through this particular ordering problem in one of our early offline discussions and decided to capture this in design doc as well. Since that it had come out very openly in test failures, stressed the same in this PR discussions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] wilfred-s commented on pull request #332: YUNIKORN-905: Core side changes of YUNIKORN-337

Posted by GitBox <gi...@apache.org>.
wilfred-s commented on pull request #332:
URL: https://github.com/apache/incubator-yunikorn-core/pull/332#issuecomment-962774771


   Based on the unit test failures we must make sure that the order in the shim is correct. First recover apps then the nodes.
   Looking at the change we might have had an issue in the RMProxy for a long time. I do think that we need to add a retry in the node update when we recover the node. Even in the previous implementation there was no guarantee that all the application were added before a node was recovered. The tests in the unit tests used the order processing dependency to make sure it worked. There was _never_ an order requirements on the events send by a shim. An event to recover a node could be a separate UpdateRequest from the applications that should be recovered. That means we relied on the go routine ordering to hopefully do things correctly: i.e. events send by the shim to create new apps would be processed before node recovery started.
   That is a dangerous assumption: filing a follow up jira.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org