You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@yunikorn.apache.org by GitBox <gi...@apache.org> on 2021/01/06 14:07:01 UTC

[GitHub] [incubator-yunikorn-core] wilfred-s opened a new pull request #236: [YUNIKORN-476] placeholder allocation and swap

wilfred-s opened a new pull request #236:
URL: https://github.com/apache/incubator-yunikorn-core/pull/236


   Logic for the placeholder allocation and swap of the placeholder with a
   real allocation.
   Includes partial unit tests, no smoke tests.
   Requires the changes from YUNIKORN-497 from the SI


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] wilfred-s commented on a change in pull request #236: [YUNIKORN-476] placeholder allocation and swap

Posted by GitBox <gi...@apache.org>.
wilfred-s commented on a change in pull request #236:
URL: https://github.com/apache/incubator-yunikorn-core/pull/236#discussion_r556220812



##########
File path: pkg/scheduler/objects/application.go
##########
@@ -604,6 +630,121 @@ func (sa *Application) tryAllocate(headRoom *resources.Resource, nodeIterator fu
 	return nil
 }
 
+// Try to replace a placeholder with a real allocation
+func (sa *Application) tryPlaceholderAllocate(nodeIterator func() interfaces.NodeIterator) *Allocation {
+	// cannot allocate if the iterator is not giving us any nodes
+	iterator := nodeIterator()
+	if iterator == nil {
+		return nil
+	}
+	sa.Lock()
+	defer sa.Unlock()
+	// nothing to do if we have no placeholders allocated
+	if resources.IsZero(sa.allocatedPlaceholder) {
+		return nil
+	}
+	// make sure the request are sorted
+	sa.sortRequests(false)
+	// keep the first fits for later
+	var phFit *Allocation
+	var reqFit *AllocationAsk
+	// get all the requests from the app sorted in order
+	for _, request := range sa.sortedRequests {
+		// skip placeholders they follow standard allocation
+		// this should also be part of a task group just make sure it is
+		if request.placeholder || request.taskGroupName == "" {
+			continue
+		}
+		// walk over the placeholders, allow for processing all as we can have multiple task groups
+		phAllocs := sa.getPlaceholderAllocations()
+		for _, ph := range phAllocs {
+			// we could have already released this placeholder and are waiting for the shim to confirm
+			// and check that we have the correct task group before trying to swap
+			if ph.released || request.taskGroupName != ph.taskGroupName {
+				continue
+			}
+			if phFit == nil && reqFit == nil {
+				phFit = ph
+				reqFit = request
+			}
+			// this is inefficient but needs to do for now as we only have the ID not the handle on the node itself
+			var ok bool
+			var node *Node
+			for iterator.HasNext() {
+				node, ok = iterator.Next().(*Node)
+				if !ok {
+					log.Logger().Warn("Node iterator failed to return a node")
+					return nil
+				}
+				if node.NodeID == ph.NodeID {
+					break
+				}
+			}
+			// got the node run same checks as for reservation (all but fits)
+			// resource usage should not change anyway between placeholder and real one
+			if node != nil && node.preReserveConditions(request.AllocationKey) {

Review comment:
       Filed a follow up jira for that, not sure where it should happen as we can do that in either the core or the shim




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] yangwwei removed a comment on pull request #236: [YUNIKORN-476] placeholder allocation and swap

Posted by GitBox <gi...@apache.org>.
yangwwei removed a comment on pull request #236:
URL: https://github.com/apache/incubator-yunikorn-core/pull/236#issuecomment-756606256


   hi @TaoYang526  thanks for the draft PR. At the first glance, I wasn't able to fully understand some of the places. After reading more info from the previous design doc in https://issues.apache.org/jira/browse/YUNIKORN-1. They start to make sense. I pretty like interfaces defined in this PR, the confusion part is `queue_request_manager_plugin.go`.  I feel we can somehow optimize this to make it simpler. I will look into more details and get back to you. Thanks.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] yangwwei commented on a change in pull request #236: [YUNIKORN-476] placeholder allocation and swap

Posted by GitBox <gi...@apache.org>.
yangwwei commented on a change in pull request #236:
URL: https://github.com/apache/incubator-yunikorn-core/pull/236#discussion_r556162745



##########
File path: pkg/scheduler/objects/application.go
##########
@@ -604,6 +630,121 @@ func (sa *Application) tryAllocate(headRoom *resources.Resource, nodeIterator fu
 	return nil
 }
 
+// Try to replace a placeholder with a real allocation
+func (sa *Application) tryPlaceholderAllocate(nodeIterator func() interfaces.NodeIterator) *Allocation {
+	// cannot allocate if the iterator is not giving us any nodes
+	iterator := nodeIterator()
+	if iterator == nil {
+		return nil
+	}
+	sa.Lock()
+	defer sa.Unlock()
+	// nothing to do if we have no placeholders allocated
+	if resources.IsZero(sa.allocatedPlaceholder) {
+		return nil
+	}
+	// make sure the request are sorted
+	sa.sortRequests(false)
+	// keep the first fits for later
+	var phFit *Allocation
+	var reqFit *AllocationAsk
+	// get all the requests from the app sorted in order
+	for _, request := range sa.sortedRequests {
+		// skip placeholders they follow standard allocation
+		// this should also be part of a task group just make sure it is
+		if request.placeholder || request.taskGroupName == "" {
+			continue
+		}
+		// walk over the placeholders, allow for processing all as we can have multiple task groups
+		phAllocs := sa.getPlaceholderAllocations()
+		for _, ph := range phAllocs {
+			// we could have already released this placeholder and are waiting for the shim to confirm
+			// and check that we have the correct task group before trying to swap
+			if ph.released || request.taskGroupName != ph.taskGroupName {
+				continue
+			}
+			if phFit == nil && reqFit == nil {
+				phFit = ph
+				reqFit = request
+			}
+			// this is inefficient but needs to do for now as we only have the ID not the handle on the node itself
+			var ok bool
+			var node *Node
+			for iterator.HasNext() {
+				node, ok = iterator.Next().(*Node)
+				if !ok {
+					log.Logger().Warn("Node iterator failed to return a node")
+					return nil
+				}
+				if node.NodeID == ph.NodeID {
+					break
+				}
+			}
+			// got the node run same checks as for reservation (all but fits)
+			// resource usage should not change anyway between placeholder and real one
+			if node != nil && node.preReserveConditions(request.AllocationKey) {
+				alloc := NewAllocation(common.GetNewUUID(), node.NodeID, request)
+				// double link to make it easier to find
+				alloc.Releases = []*Allocation{ph}
+				ph.Releases = []*Allocation{alloc}
+				alloc.Result = Replaced
+				// mark placeholder as released
+				ph.released = true
+				_, err := sa.updateAskRepeatInternal(request, -1)
+				if err != nil {
+					log.Logger().Warn("ask repeat update failed unexpectedly",
+						zap.Error(err))
+				}
+				return alloc
+			}
+			iterator.Reset()
+		}
+	}
+	// we checked all placeholders and asks nothing worked as yet
+	// pick the first fit and try all nodes if that fails give up
+	if phFit != nil && reqFit != nil {
+		for iterator.HasNext() {
+			node, ok := iterator.Next().(*Node)
+			if !ok {
+				log.Logger().Warn("Node iterator failed to return a node")
+				return nil
+			}
+			if err := node.preAllocateCheck(reqFit.AllocatedResource, reservationKey(nil, sa, reqFit), false); err != nil {
+				continue
+			}
+			// skip the node if conditions can not be satisfied
+			if !node.preAllocateConditions(reqFit.AllocationKey) {
+				continue
+			}
+			// allocation worked: on a non placeholder node update result and return
+			alloc := NewAllocation(common.GetNewUUID(), node.NodeID, reqFit)
+			// double link to make it easier to find
+			alloc.Releases = []*Allocation{phFit}
+			phFit.Releases = []*Allocation{alloc}
+			alloc.Result = Replaced

Review comment:
       This doesn't seem to be correct.
   if placeholder is on node `A`, and the alloc is made on node `B`
   here we send the shim to release the placeholder, which is on `A`
   once shim confirms that the placeholder has been released, in `partition#removeAllocation()`, it replaces the allocation with placeholder allocation, which is on `B`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] wilfred-s commented on a change in pull request #236: [YUNIKORN-476] placeholder allocation and swap

Posted by GitBox <gi...@apache.org>.
wilfred-s commented on a change in pull request #236:
URL: https://github.com/apache/incubator-yunikorn-core/pull/236#discussion_r555502337



##########
File path: pkg/scheduler/objects/application.go
##########
@@ -604,6 +630,121 @@ func (sa *Application) tryAllocate(headRoom *resources.Resource, nodeIterator fu
 	return nil
 }
 
+// Try to replace a placeholder with a real allocation
+func (sa *Application) tryPlaceholderAllocate(nodeIterator func() interfaces.NodeIterator) *Allocation {
+	// cannot allocate if the iterator is not giving us any nodes
+	iterator := nodeIterator()
+	if iterator == nil {
+		return nil
+	}
+	sa.Lock()
+	defer sa.Unlock()
+	// nothing to do if we have no placeholders allocated
+	if resources.IsZero(sa.allocatedPlaceholder) {
+		return nil
+	}
+	// make sure the request are sorted
+	sa.sortRequests(false)
+	// keep the first fits for later
+	var phFit *Allocation
+	var reqFit *AllocationAsk
+	// get all the requests from the app sorted in order
+	for _, request := range sa.sortedRequests {
+		// skip placeholders they follow standard allocation
+		// this should also be part of a task group just make sure it is
+		if request.placeholder || request.taskGroupName == "" {
+			continue
+		}
+		// walk over the placeholders, allow for processing all as we can have multiple task groups
+		phAllocs := sa.getPlaceholderAllocations()
+		for _, ph := range phAllocs {
+			// we could have already released this placeholder and are waiting for the shim to confirm
+			// and check that we have the correct task group before trying to swap
+			if ph.released || request.taskGroupName != ph.taskGroupName {
+				continue
+			}
+			if phFit == nil && reqFit == nil {
+				phFit = ph
+				reqFit = request
+			}
+			// this is inefficient but needs to do for now as we only have the ID not the handle on the node itself
+			var ok bool
+			var node *Node
+			for iterator.HasNext() {
+				node, ok = iterator.Next().(*Node)
+				if !ok {
+					log.Logger().Warn("Node iterator failed to return a node")
+					return nil
+				}
+				if node.NodeID == ph.NodeID {
+					break
+				}
+			}
+			// got the node run same checks as for reservation (all but fits)
+			// resource usage should not change anyway between placeholder and real one
+			if node != nil && node.preReserveConditions(request.AllocationKey) {
+				alloc := NewAllocation(common.GetNewUUID(), node.NodeID, request)
+				// double link to make it easier to find
+				alloc.Releases = []*Allocation{ph}
+				ph.Releases = []*Allocation{alloc}
+				alloc.Result = Replaced
+				// mark placeholder as released
+				ph.released = true
+				_, err := sa.updateAskRepeatInternal(request, -1)
+				if err != nil {
+					log.Logger().Warn("ask repeat update failed unexpectedly",
+						zap.Error(err))
+				}
+				return alloc
+			}
+			iterator.Reset()
+		}
+	}
+	// we checked all placeholders and asks nothing worked as yet
+	// pick the first fit and try all nodes if that fails give up
+	if phFit != nil && reqFit != nil {
+		for iterator.HasNext() {
+			node, ok := iterator.Next().(*Node)
+			if !ok {
+				log.Logger().Warn("Node iterator failed to return a node")
+				return nil
+			}
+			if err := node.preAllocateCheck(reqFit.AllocatedResource, reservationKey(nil, sa, reqFit), false); err != nil {
+				continue
+			}
+			// skip the node if conditions can not be satisfied
+			if !node.preAllocateConditions(reqFit.AllocationKey) {
+				continue
+			}
+			// allocation worked: on a non placeholder node update result and return
+			alloc := NewAllocation(common.GetNewUUID(), node.NodeID, reqFit)
+			// double link to make it easier to find
+			alloc.Releases = []*Allocation{phFit}
+			phFit.Releases = []*Allocation{alloc}
+			alloc.Result = Replaced
+			// mark placeholder as released
+			phFit.released = true
+			// update just the node to make sure we keep its spot
+			// no queue update as we're releasing the placeholder and are just temp over the size
+			if !node.AddAllocation(alloc) {
+				log.Logger().Debug("Node update failed unexpectedly",
+					zap.String("applicationID", sa.ApplicationID),
+					zap.String("ask", reqFit.String()),
+					zap.String("placeholder", phFit.String()))
+				return nil

Review comment:
       That would break the gang. If the placeholder is allocated the queue usage is reserved for the gang. When we release the placeholder there is no guarantee that another application in the queue is not going to take that space. This could be an older application or even a newer application.
   It would in (almost) all cases directly break gang scheduling

##########
File path: pkg/scheduler/objects/application.go
##########
@@ -604,6 +630,121 @@ func (sa *Application) tryAllocate(headRoom *resources.Resource, nodeIterator fu
 	return nil
 }
 
+// Try to replace a placeholder with a real allocation
+func (sa *Application) tryPlaceholderAllocate(nodeIterator func() interfaces.NodeIterator) *Allocation {
+	// cannot allocate if the iterator is not giving us any nodes
+	iterator := nodeIterator()
+	if iterator == nil {
+		return nil
+	}
+	sa.Lock()
+	defer sa.Unlock()
+	// nothing to do if we have no placeholders allocated
+	if resources.IsZero(sa.allocatedPlaceholder) {
+		return nil
+	}
+	// make sure the request are sorted
+	sa.sortRequests(false)
+	// keep the first fits for later
+	var phFit *Allocation
+	var reqFit *AllocationAsk
+	// get all the requests from the app sorted in order
+	for _, request := range sa.sortedRequests {
+		// skip placeholders they follow standard allocation
+		// this should also be part of a task group just make sure it is
+		if request.placeholder || request.taskGroupName == "" {
+			continue
+		}
+		// walk over the placeholders, allow for processing all as we can have multiple task groups
+		phAllocs := sa.getPlaceholderAllocations()
+		for _, ph := range phAllocs {
+			// we could have already released this placeholder and are waiting for the shim to confirm
+			// and check that we have the correct task group before trying to swap
+			if ph.released || request.taskGroupName != ph.taskGroupName {
+				continue
+			}
+			if phFit == nil && reqFit == nil {
+				phFit = ph
+				reqFit = request
+			}
+			// this is inefficient but needs to do for now as we only have the ID not the handle on the node itself
+			var ok bool
+			var node *Node
+			for iterator.HasNext() {
+				node, ok = iterator.Next().(*Node)
+				if !ok {
+					log.Logger().Warn("Node iterator failed to return a node")
+					return nil
+				}
+				if node.NodeID == ph.NodeID {
+					break
+				}
+			}
+			// got the node run same checks as for reservation (all but fits)
+			// resource usage should not change anyway between placeholder and real one
+			if node != nil && node.preReserveConditions(request.AllocationKey) {
+				alloc := NewAllocation(common.GetNewUUID(), node.NodeID, request)
+				// double link to make it easier to find
+				alloc.Releases = []*Allocation{ph}
+				ph.Releases = []*Allocation{alloc}
+				alloc.Result = Replaced
+				// mark placeholder as released
+				ph.released = true
+				_, err := sa.updateAskRepeatInternal(request, -1)
+				if err != nil {
+					log.Logger().Warn("ask repeat update failed unexpectedly",
+						zap.Error(err))
+				}
+				return alloc
+			}
+			iterator.Reset()
+		}
+	}
+	// we checked all placeholders and asks nothing worked as yet
+	// pick the first fit and try all nodes if that fails give up
+	if phFit != nil && reqFit != nil {

Review comment:
       We cannot reuse that logic as it does not allow us to link back to the replacement. We also need to be careful with the release if no allocation is made. The queue usage would drop if we release without allocating and the resource could be used by another application in the queue. That would break the gang as it might not be possible o get that back.

##########
File path: pkg/scheduler/objects/application.go
##########
@@ -604,6 +630,121 @@ func (sa *Application) tryAllocate(headRoom *resources.Resource, nodeIterator fu
 	return nil
 }
 
+// Try to replace a placeholder with a real allocation
+func (sa *Application) tryPlaceholderAllocate(nodeIterator func() interfaces.NodeIterator) *Allocation {
+	// cannot allocate if the iterator is not giving us any nodes
+	iterator := nodeIterator()
+	if iterator == nil {
+		return nil
+	}
+	sa.Lock()
+	defer sa.Unlock()
+	// nothing to do if we have no placeholders allocated
+	if resources.IsZero(sa.allocatedPlaceholder) {
+		return nil
+	}
+	// make sure the request are sorted
+	sa.sortRequests(false)
+	// keep the first fits for later
+	var phFit *Allocation
+	var reqFit *AllocationAsk
+	// get all the requests from the app sorted in order
+	for _, request := range sa.sortedRequests {
+		// skip placeholders they follow standard allocation
+		// this should also be part of a task group just make sure it is
+		if request.placeholder || request.taskGroupName == "" {
+			continue
+		}
+		// walk over the placeholders, allow for processing all as we can have multiple task groups
+		phAllocs := sa.getPlaceholderAllocations()
+		for _, ph := range phAllocs {

Review comment:
       Yes we do, we do not know if the placeholder that we get is for the specific TaskGroup. Since we can have multiple task groups on each application we need to find one that matches.
   There is no reason to assume that if one placeholder fails they all fail. The expected cause for failures to allocate the real pod would be that the node predicates fail. This could be a temporary issue or just for that one specific node. Assuming the same state for all nodes is wrong. In the case that all placeholders have issues there is something strange going on.
   We can also not run the normal scheduling cycle (see next review comment).

##########
File path: pkg/scheduler/objects/application.go
##########
@@ -604,6 +630,121 @@ func (sa *Application) tryAllocate(headRoom *resources.Resource, nodeIterator fu
 	return nil
 }
 
+// Try to replace a placeholder with a real allocation
+func (sa *Application) tryPlaceholderAllocate(nodeIterator func() interfaces.NodeIterator) *Allocation {
+	// cannot allocate if the iterator is not giving us any nodes
+	iterator := nodeIterator()
+	if iterator == nil {
+		return nil
+	}
+	sa.Lock()
+	defer sa.Unlock()
+	// nothing to do if we have no placeholders allocated
+	if resources.IsZero(sa.allocatedPlaceholder) {
+		return nil
+	}
+	// make sure the request are sorted
+	sa.sortRequests(false)
+	// keep the first fits for later
+	var phFit *Allocation
+	var reqFit *AllocationAsk
+	// get all the requests from the app sorted in order
+	for _, request := range sa.sortedRequests {
+		// skip placeholders they follow standard allocation
+		// this should also be part of a task group just make sure it is
+		if request.placeholder || request.taskGroupName == "" {
+			continue
+		}
+		// walk over the placeholders, allow for processing all as we can have multiple task groups
+		phAllocs := sa.getPlaceholderAllocations()
+		for _, ph := range phAllocs {
+			// we could have already released this placeholder and are waiting for the shim to confirm
+			// and check that we have the correct task group before trying to swap
+			if ph.released || request.taskGroupName != ph.taskGroupName {
+				continue
+			}
+			if phFit == nil && reqFit == nil {
+				phFit = ph
+				reqFit = request
+			}
+			// this is inefficient but needs to do for now as we only have the ID not the handle on the node itself
+			var ok bool
+			var node *Node
+			for iterator.HasNext() {
+				node, ok = iterator.Next().(*Node)
+				if !ok {
+					log.Logger().Warn("Node iterator failed to return a node")
+					return nil
+				}
+				if node.NodeID == ph.NodeID {
+					break

Review comment:
       This issue resolves itself with the signature change.

##########
File path: pkg/scheduler/objects/application.go
##########
@@ -604,6 +630,121 @@ func (sa *Application) tryAllocate(headRoom *resources.Resource, nodeIterator fu
 	return nil
 }
 
+// Try to replace a placeholder with a real allocation
+func (sa *Application) tryPlaceholderAllocate(nodeIterator func() interfaces.NodeIterator) *Allocation {
+	// cannot allocate if the iterator is not giving us any nodes
+	iterator := nodeIterator()
+	if iterator == nil {
+		return nil
+	}
+	sa.Lock()
+	defer sa.Unlock()
+	// nothing to do if we have no placeholders allocated
+	if resources.IsZero(sa.allocatedPlaceholder) {
+		return nil
+	}
+	// make sure the request are sorted
+	sa.sortRequests(false)
+	// keep the first fits for later
+	var phFit *Allocation
+	var reqFit *AllocationAsk
+	// get all the requests from the app sorted in order
+	for _, request := range sa.sortedRequests {
+		// skip placeholders they follow standard allocation
+		// this should also be part of a task group just make sure it is
+		if request.placeholder || request.taskGroupName == "" {
+			continue
+		}
+		// walk over the placeholders, allow for processing all as we can have multiple task groups
+		phAllocs := sa.getPlaceholderAllocations()
+		for _, ph := range phAllocs {
+			// we could have already released this placeholder and are waiting for the shim to confirm
+			// and check that we have the correct task group before trying to swap
+			if ph.released || request.taskGroupName != ph.taskGroupName {
+				continue
+			}
+			if phFit == nil && reqFit == nil {
+				phFit = ph
+				reqFit = request
+			}
+			// this is inefficient but needs to do for now as we only have the ID not the handle on the node itself

Review comment:
       A new TODO will cause the lint check to fail and thus the build to be marked as failed. That is why I did not add it.
   The solution is rather simple for this issue and would require a change of the function signature to:
   ```
   TryPlaceholderAllocate(nodeIterator func() interfaces.NodeIterator, node func() *objects.Node)
   ```
   that will allow us to just retrieve the node and remove a lot of the iteration. I will push a patch for that.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] codecov[bot] edited a comment on pull request #236: [YUNIKORN-476] placeholder allocation and swap

Posted by GitBox <gi...@apache.org>.
codecov[bot] edited a comment on pull request #236:
URL: https://github.com/apache/incubator-yunikorn-core/pull/236#issuecomment-755985425


   # [Codecov](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=h1) Report
   > Merging [#236](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=desc) (a019a8c) into [master](https://codecov.io/gh/apache/incubator-yunikorn-core/commit/5471d84cd619dc9b6c40e970b0e9d53aa3f8e07f?el=desc) (5471d84) will **decrease** coverage by `0.95%`.
   > The diff coverage is `42.71%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/graphs/tree.svg?width=650&height=150&src=pr&token=SB9NrIi3Hy)](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master     #236      +/-   ##
   ==========================================
   - Coverage   64.02%   63.06%   -0.96%     
   ==========================================
     Files          60       60              
     Lines        4984     5199     +215     
   ==========================================
   + Hits         3191     3279      +88     
   - Misses       1636     1759     +123     
   - Partials      157      161       +4     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [pkg/scheduler/context.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9jb250ZXh0Lmdv) | `6.26% <0.00%> (-0.17%)` | :arrow_down: |
   | [pkg/scheduler/objects/node.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9vYmplY3RzL25vZGUuZ28=) | `81.81% <0.00%> (-1.52%)` | :arrow_down: |
   | [pkg/scheduler/objects/application.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9vYmplY3RzL2FwcGxpY2F0aW9uLmdv) | `49.06% <28.92%> (-6.25%)` | :arrow_down: |
   | [pkg/scheduler/partition.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9wYXJ0aXRpb24uZ28=) | `62.79% <44.15%> (-3.58%)` | :arrow_down: |
   | [pkg/scheduler/objects/queue.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9vYmplY3RzL3F1ZXVlLmdv) | `69.44% <58.06%> (-0.54%)` | :arrow_down: |
   | [pkg/common/utils.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL2NvbW1vbi91dGlscy5nbw==) | `77.50% <100.00%> (+5.62%)` | :arrow_up: |
   | [pkg/scheduler/objects/allocation.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9vYmplY3RzL2FsbG9jYXRpb24uZ28=) | `100.00% <100.00%> (+22.22%)` | :arrow_up: |
   | [pkg/scheduler/objects/allocation\_ask.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9vYmplY3RzL2FsbG9jYXRpb25fYXNrLmdv) | `93.87% <100.00%> (+2.96%)` | :arrow_up: |
   | ... and [1 more](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=footer). Last update [5471d84...18102bf](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] yangwwei commented on a change in pull request #236: [YUNIKORN-476] placeholder allocation and swap

Posted by GitBox <gi...@apache.org>.
yangwwei commented on a change in pull request #236:
URL: https://github.com/apache/incubator-yunikorn-core/pull/236#discussion_r556207918



##########
File path: pkg/scheduler/objects/application.go
##########
@@ -604,6 +630,121 @@ func (sa *Application) tryAllocate(headRoom *resources.Resource, nodeIterator fu
 	return nil
 }
 
+// Try to replace a placeholder with a real allocation
+func (sa *Application) tryPlaceholderAllocate(nodeIterator func() interfaces.NodeIterator) *Allocation {
+	// cannot allocate if the iterator is not giving us any nodes
+	iterator := nodeIterator()
+	if iterator == nil {
+		return nil
+	}
+	sa.Lock()
+	defer sa.Unlock()
+	// nothing to do if we have no placeholders allocated
+	if resources.IsZero(sa.allocatedPlaceholder) {
+		return nil
+	}
+	// make sure the request are sorted
+	sa.sortRequests(false)
+	// keep the first fits for later
+	var phFit *Allocation
+	var reqFit *AllocationAsk
+	// get all the requests from the app sorted in order
+	for _, request := range sa.sortedRequests {
+		// skip placeholders they follow standard allocation
+		// this should also be part of a task group just make sure it is
+		if request.placeholder || request.taskGroupName == "" {
+			continue
+		}
+		// walk over the placeholders, allow for processing all as we can have multiple task groups
+		phAllocs := sa.getPlaceholderAllocations()
+		for _, ph := range phAllocs {
+			// we could have already released this placeholder and are waiting for the shim to confirm
+			// and check that we have the correct task group before trying to swap
+			if ph.released || request.taskGroupName != ph.taskGroupName {
+				continue
+			}
+			if phFit == nil && reqFit == nil {
+				phFit = ph
+				reqFit = request
+			}
+			// this is inefficient but needs to do for now as we only have the ID not the handle on the node itself
+			var ok bool
+			var node *Node
+			for iterator.HasNext() {
+				node, ok = iterator.Next().(*Node)
+				if !ok {
+					log.Logger().Warn("Node iterator failed to return a node")
+					return nil
+				}
+				if node.NodeID == ph.NodeID {
+					break
+				}
+			}
+			// got the node run same checks as for reservation (all but fits)
+			// resource usage should not change anyway between placeholder and real one
+			if node != nil && node.preReserveConditions(request.AllocationKey) {
+				alloc := NewAllocation(common.GetNewUUID(), node.NodeID, request)
+				// double link to make it easier to find
+				alloc.Releases = []*Allocation{ph}
+				ph.Releases = []*Allocation{alloc}
+				alloc.Result = Replaced
+				// mark placeholder as released
+				ph.released = true
+				_, err := sa.updateAskRepeatInternal(request, -1)
+				if err != nil {
+					log.Logger().Warn("ask repeat update failed unexpectedly",
+						zap.Error(err))
+				}
+				return alloc
+			}
+			iterator.Reset()
+		}
+	}
+	// we checked all placeholders and asks nothing worked as yet
+	// pick the first fit and try all nodes if that fails give up
+	if phFit != nil && reqFit != nil {
+		for iterator.HasNext() {
+			node, ok := iterator.Next().(*Node)
+			if !ok {
+				log.Logger().Warn("Node iterator failed to return a node")
+				return nil
+			}
+			if err := node.preAllocateCheck(reqFit.AllocatedResource, reservationKey(nil, sa, reqFit), false); err != nil {
+				continue
+			}
+			// skip the node if conditions can not be satisfied
+			if !node.preAllocateConditions(reqFit.AllocationKey) {
+				continue
+			}
+			// allocation worked: on a non placeholder node update result and return
+			alloc := NewAllocation(common.GetNewUUID(), node.NodeID, reqFit)
+			// double link to make it easier to find
+			alloc.Releases = []*Allocation{phFit}
+			phFit.Releases = []*Allocation{alloc}
+			alloc.Result = Replaced

Review comment:
       Already got the answer from @wilfred-s , closing this one.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] wilfred-s commented on a change in pull request #236: [YUNIKORN-476] placeholder allocation and swap

Posted by GitBox <gi...@apache.org>.
wilfred-s commented on a change in pull request #236:
URL: https://github.com/apache/incubator-yunikorn-core/pull/236#discussion_r556227691



##########
File path: pkg/scheduler/context.go
##########
@@ -111,38 +111,22 @@ func (cc *ClusterContext) schedule() {
 		if psc.isStopped() {
 			continue
 		}
-		// try reservations first
-		alloc := psc.tryReservedAllocate()
-		// nothing reserved that can be allocated try normal allocate
+		// placeholder replacement first
+		alloc := psc.tryPlaceholderAllocate()

Review comment:
       swapped the logic: first reserve, then placeholder and normal last




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] codecov[bot] edited a comment on pull request #236: [YUNIKORN-476] placeholder allocation and swap

Posted by GitBox <gi...@apache.org>.
codecov[bot] edited a comment on pull request #236:
URL: https://github.com/apache/incubator-yunikorn-core/pull/236#issuecomment-755985425


   # [Codecov](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=h1) Report
   > Merging [#236](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=desc) (98a2634) into [master](https://codecov.io/gh/apache/incubator-yunikorn-core/commit/5471d84cd619dc9b6c40e970b0e9d53aa3f8e07f?el=desc) (5471d84) will **decrease** coverage by `1.17%`.
   > The diff coverage is `39.49%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/graphs/tree.svg?width=650&height=150&src=pr&token=SB9NrIi3Hy)](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master     #236      +/-   ##
   ==========================================
   - Coverage   64.02%   62.84%   -1.18%     
   ==========================================
     Files          60       60              
     Lines        4984     5216     +232     
   ==========================================
   + Hits         3191     3278      +87     
   - Misses       1636     1777     +141     
   - Partials      157      161       +4     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [pkg/scheduler/context.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9jb250ZXh0Lmdv) | `6.26% <0.00%> (-0.17%)` | :arrow_down: |
   | [pkg/scheduler/objects/node.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9vYmplY3RzL25vZGUuZ28=) | `81.81% <0.00%> (-1.52%)` | :arrow_down: |
   | [pkg/scheduler/objects/application.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9vYmplY3RzL2FwcGxpY2F0aW9uLmdv) | `48.33% <27.13%> (-6.97%)` | :arrow_down: |
   | [pkg/scheduler/partition.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9wYXJ0aXRpb24uZ28=) | `61.70% <36.25%> (-4.67%)` | :arrow_down: |
   | [pkg/scheduler/objects/queue.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9vYmplY3RzL3F1ZXVlLmdv) | `69.44% <58.06%> (-0.54%)` | :arrow_down: |
   | [pkg/common/utils.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL2NvbW1vbi91dGlscy5nbw==) | `77.50% <100.00%> (+5.62%)` | :arrow_up: |
   | [pkg/scheduler/objects/allocation.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9vYmplY3RzL2FsbG9jYXRpb24uZ28=) | `100.00% <100.00%> (+22.22%)` | :arrow_up: |
   | [pkg/scheduler/objects/allocation\_ask.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9vYmplY3RzL2FsbG9jYXRpb25fYXNrLmdv) | `93.87% <100.00%> (+2.96%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=footer). Last update [5471d84...422f9be](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] yangwwei commented on pull request #236: [YUNIKORN-476] placeholder allocation and swap

Posted by GitBox <gi...@apache.org>.
yangwwei commented on pull request #236:
URL: https://github.com/apache/incubator-yunikorn-core/pull/236#issuecomment-756606256


   hi @TaoYang526  thanks for the draft PR. At the first glance, I wasn't able to fully understand some of the places. After reading more info from the previous design doc in https://issues.apache.org/jira/browse/YUNIKORN-1. They start to make sense. I pretty like interfaces defined in this PR, the confusion part is `queue_request_manager_plugin.go`.  I feel we can somehow optimize this to make it simpler. I will look into more details and get back to you. Thanks.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] kingamarton commented on a change in pull request #236: [YUNIKORN-476] placeholder allocation and swap

Posted by GitBox <gi...@apache.org>.
kingamarton commented on a change in pull request #236:
URL: https://github.com/apache/incubator-yunikorn-core/pull/236#discussion_r556616073



##########
File path: pkg/scheduler/objects/allocation_ask.go
##########
@@ -117,3 +134,21 @@ func (aa *AllocationAsk) setPriority(prio int32) {
 	defer aa.Unlock()
 	aa.priority = prio
 }
+
+func (aa *AllocationAsk) isPlaceholder() bool {
+	aa.Lock()
+	defer aa.Unlock()
+	return aa.placeholder
+}
+
+func (aa *AllocationAsk) getTaskGroup() string {
+	aa.Lock()
+	defer aa.Unlock()
+	return aa.taskGroupName
+}
+
+func (aa *AllocationAsk) getTimeout() time.Duration {
+	aa.Lock()
+	defer aa.Unlock()

Review comment:
       A read lock should be enough here as well

##########
File path: pkg/scheduler/objects/allocation_ask.go
##########
@@ -117,3 +134,21 @@ func (aa *AllocationAsk) setPriority(prio int32) {
 	defer aa.Unlock()
 	aa.priority = prio
 }
+
+func (aa *AllocationAsk) isPlaceholder() bool {
+	aa.Lock()
+	defer aa.Unlock()

Review comment:
       A read lock should be enough here

##########
File path: pkg/scheduler/objects/application.go
##########
@@ -604,6 +630,121 @@ func (sa *Application) tryAllocate(headRoom *resources.Resource, nodeIterator fu
 	return nil
 }
 
+// Try to replace a placeholder with a real allocation
+func (sa *Application) tryPlaceholderAllocate(nodeIterator func() interfaces.NodeIterator) *Allocation {
+	// cannot allocate if the iterator is not giving us any nodes
+	iterator := nodeIterator()
+	if iterator == nil {
+		return nil
+	}
+	sa.Lock()
+	defer sa.Unlock()
+	// nothing to do if we have no placeholders allocated
+	if resources.IsZero(sa.allocatedPlaceholder) {
+		return nil
+	}
+	// make sure the request are sorted
+	sa.sortRequests(false)
+	// keep the first fits for later
+	var phFit *Allocation
+	var reqFit *AllocationAsk
+	// get all the requests from the app sorted in order
+	for _, request := range sa.sortedRequests {
+		// skip placeholders they follow standard allocation
+		// this should also be part of a task group just make sure it is
+		if request.placeholder || request.taskGroupName == "" {
+			continue
+		}
+		// walk over the placeholders, allow for processing all as we can have multiple task groups
+		phAllocs := sa.getPlaceholderAllocations()
+		for _, ph := range phAllocs {
+			// we could have already released this placeholder and are waiting for the shim to confirm
+			// and check that we have the correct task group before trying to swap
+			if ph.released || request.taskGroupName != ph.taskGroupName {
+				continue
+			}
+			if phFit == nil && reqFit == nil {
+				phFit = ph
+				reqFit = request
+			}
+			// this is inefficient but needs to do for now as we only have the ID not the handle on the node itself
+			var ok bool
+			var node *Node
+			for iterator.HasNext() {
+				node, ok = iterator.Next().(*Node)
+				if !ok {
+					log.Logger().Warn("Node iterator failed to return a node")
+					return nil
+				}
+				if node.NodeID == ph.NodeID {
+					break
+				}
+			}
+			// got the node run same checks as for reservation (all but fits)
+			// resource usage should not change anyway between placeholder and real one
+			if node != nil && node.preReserveConditions(request.AllocationKey) {

Review comment:
       I think we should add a check in the core as well, even if we add it in the shim, to make sure that the core is working properly(throw an error) in case of non matching resources instead of having some unexpected behaviour.

##########
File path: pkg/scheduler/objects/application.go
##########
@@ -604,6 +630,112 @@ func (sa *Application) tryAllocate(headRoom *resources.Resource, nodeIterator fu
 	return nil
 }
 
+// Try to replace a placeholder with a real allocation
+func (sa *Application) tryPlaceholderAllocate(nodeIterator func() interfaces.NodeIterator, getnode func(string) *Node) *Allocation {
+	sa.Lock()
+	defer sa.Unlock()
+	// nothing to do if we have no placeholders allocated
+	if resources.IsZero(sa.allocatedPlaceholder) {
+		return nil
+	}
+	// make sure the request are sorted
+	sa.sortRequests(false)
+	// keep the first fits for later
+	var phFit *Allocation
+	var reqFit *AllocationAsk
+	// get all the requests from the app sorted in order
+	for _, request := range sa.sortedRequests {
+		// skip placeholders they follow standard allocation
+		// this should also be part of a task group just make sure it is
+		if request.placeholder || request.taskGroupName == "" {
+			continue
+		}
+		// walk over the placeholders, allow for processing all as we can have multiple task groups
+		phAllocs := sa.getPlaceholderAllocations()
+		for _, ph := range phAllocs {
+			// we could have already released this placeholder and are waiting for the shim to confirm
+			// and check that we have the correct task group before trying to swap
+			if ph.released || request.taskGroupName != ph.taskGroupName {
+				continue
+			}
+			if phFit == nil && reqFit == nil {
+				phFit = ph
+				reqFit = request
+			}
+			node := getnode(ph.NodeID)
+			// got the node run same checks as for reservation (all but fits)
+			// resource usage should not change anyway between placeholder and real one
+			if node != nil && node.preReserveConditions(request.AllocationKey) {
+				alloc := NewAllocation(common.GetNewUUID(), node.NodeID, request)
+				// double link to make it easier to find
+				// alloc (the real one) releases points to the placeholder in the releases list
+				alloc.Releases = []*Allocation{ph}
+				// placeholder point to the real one in the releases list
+				ph.Releases = []*Allocation{alloc}
+				alloc.Result = Replaced
+				// mark placeholder as released
+				ph.released = true
+				_, err := sa.updateAskRepeatInternal(request, -1)
+				if err != nil {
+					log.Logger().Warn("ask repeat update failed unexpectedly",
+						zap.Error(err))
+				}
+				return alloc
+			}
+		}
+	}
+	// cannot allocate if the iterator is not giving us any schedulable nodes
+	iterator := nodeIterator()
+	if iterator == nil {
+		return nil
+	}
+	// we checked all placeholders and asks nothing worked as yet
+	// pick the first fit and try all nodes if that fails give up
+	if phFit != nil && reqFit != nil {
+		for iterator.HasNext() {
+			node, ok := iterator.Next().(*Node)
+			if !ok {
+				log.Logger().Warn("Node iterator failed to return a node")
+				return nil
+			}
+			if err := node.preAllocateCheck(reqFit.AllocatedResource, reservationKey(nil, sa, reqFit), false); err != nil {

Review comment:
       Most of the steps we are performing in the lines 701 - 734 is the same as we are doing in the `tryNodes` part. Can you please export and reuse the common things if possible? 

##########
File path: pkg/scheduler/objects/allocation_ask.go
##########
@@ -117,3 +134,21 @@ func (aa *AllocationAsk) setPriority(prio int32) {
 	defer aa.Unlock()
 	aa.priority = prio
 }
+
+func (aa *AllocationAsk) isPlaceholder() bool {
+	aa.Lock()
+	defer aa.Unlock()
+	return aa.placeholder
+}
+
+func (aa *AllocationAsk) getTaskGroup() string {
+	aa.Lock()
+	defer aa.Unlock()

Review comment:
       A read lock should be enough here as well




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] kingamarton commented on a change in pull request #236: [YUNIKORN-476] placeholder allocation and swap

Posted by GitBox <gi...@apache.org>.
kingamarton commented on a change in pull request #236:
URL: https://github.com/apache/incubator-yunikorn-core/pull/236#discussion_r555946170



##########
File path: pkg/scheduler/objects/application.go
##########
@@ -604,6 +630,121 @@ func (sa *Application) tryAllocate(headRoom *resources.Resource, nodeIterator fu
 	return nil
 }
 
+// Try to replace a placeholder with a real allocation
+func (sa *Application) tryPlaceholderAllocate(nodeIterator func() interfaces.NodeIterator) *Allocation {
+	// cannot allocate if the iterator is not giving us any nodes
+	iterator := nodeIterator()
+	if iterator == nil {
+		return nil
+	}
+	sa.Lock()
+	defer sa.Unlock()
+	// nothing to do if we have no placeholders allocated
+	if resources.IsZero(sa.allocatedPlaceholder) {
+		return nil
+	}
+	// make sure the request are sorted
+	sa.sortRequests(false)
+	// keep the first fits for later
+	var phFit *Allocation
+	var reqFit *AllocationAsk
+	// get all the requests from the app sorted in order
+	for _, request := range sa.sortedRequests {
+		// skip placeholders they follow standard allocation
+		// this should also be part of a task group just make sure it is
+		if request.placeholder || request.taskGroupName == "" {
+			continue
+		}
+		// walk over the placeholders, allow for processing all as we can have multiple task groups
+		phAllocs := sa.getPlaceholderAllocations()
+		for _, ph := range phAllocs {
+			// we could have already released this placeholder and are waiting for the shim to confirm
+			// and check that we have the correct task group before trying to swap
+			if ph.released || request.taskGroupName != ph.taskGroupName {
+				continue
+			}
+			if phFit == nil && reqFit == nil {
+				phFit = ph
+				reqFit = request
+			}
+			// this is inefficient but needs to do for now as we only have the ID not the handle on the node itself
+			var ok bool
+			var node *Node
+			for iterator.HasNext() {
+				node, ok = iterator.Next().(*Node)
+				if !ok {
+					log.Logger().Warn("Node iterator failed to return a node")
+					return nil
+				}
+				if node.NodeID == ph.NodeID {
+					break
+				}
+			}
+			// got the node run same checks as for reservation (all but fits)
+			// resource usage should not change anyway between placeholder and real one
+			if node != nil && node.preReserveConditions(request.AllocationKey) {

Review comment:
       I know that we discussed offline that the shim side should check if the requested are the same for the real pod is the same as for the placeholder, but I think it would worth to add a check here as well.

##########
File path: pkg/scheduler/objects/application.go
##########
@@ -604,6 +630,121 @@ func (sa *Application) tryAllocate(headRoom *resources.Resource, nodeIterator fu
 	return nil
 }
 
+// Try to replace a placeholder with a real allocation
+func (sa *Application) tryPlaceholderAllocate(nodeIterator func() interfaces.NodeIterator) *Allocation {
+	// cannot allocate if the iterator is not giving us any nodes
+	iterator := nodeIterator()
+	if iterator == nil {
+		return nil
+	}
+	sa.Lock()
+	defer sa.Unlock()
+	// nothing to do if we have no placeholders allocated
+	if resources.IsZero(sa.allocatedPlaceholder) {
+		return nil
+	}
+	// make sure the request are sorted
+	sa.sortRequests(false)
+	// keep the first fits for later
+	var phFit *Allocation
+	var reqFit *AllocationAsk
+	// get all the requests from the app sorted in order
+	for _, request := range sa.sortedRequests {
+		// skip placeholders they follow standard allocation
+		// this should also be part of a task group just make sure it is
+		if request.placeholder || request.taskGroupName == "" {
+			continue
+		}
+		// walk over the placeholders, allow for processing all as we can have multiple task groups
+		phAllocs := sa.getPlaceholderAllocations()
+		for _, ph := range phAllocs {
+			// we could have already released this placeholder and are waiting for the shim to confirm
+			// and check that we have the correct task group before trying to swap
+			if ph.released || request.taskGroupName != ph.taskGroupName {
+				continue
+			}
+			if phFit == nil && reqFit == nil {
+				phFit = ph
+				reqFit = request
+			}
+			// this is inefficient but needs to do for now as we only have the ID not the handle on the node itself

Review comment:
       Is it ok that we check only the first fit? (phFit == nil && reqFit == nil). If yes, it makes no sense to iterate it to the other placeholder allocation, because in each iteration the first fit is checked in the next steps as well. If not, then I think we should update phFit and reqFit in each iteration with the new fits. 

##########
File path: pkg/scheduler/partition.go
##########
@@ -704,6 +720,57 @@ func (pc *PartitionContext) tryReservedAllocate() *objects.Allocation {
 	return nil
 }
 
+// Try process placeholder for the partition
+// Lock free call this all locks are taken when needed in called functions
+func (pc *PartitionContext) tryPlaceholderAllocate() *objects.Allocation {
+	if !resources.StrictlyGreaterThanZero(pc.root.GetPendingResource()) {
+		// nothing to do just return
+		return nil
+	}
+	// try allocating from the root down
+	alloc := pc.root.TryPlaceholderAllocate(pc.GetAllNodeIterator)
+	if alloc != nil {
+		return pc.replace(alloc)
+	}
+	return nil
+}
+
+// Process the allocation and make the left over changes in the partition.
+func (pc *PartitionContext) replace(alloc *objects.Allocation) *objects.Allocation {
+	pc.Lock()
+	defer pc.Unlock()
+	// partition is locked nothing can change from now on

Review comment:
       I think this comment line is redundant, since from lines 740 and 741, we already know that the partition is locked and that it cannot change

##########
File path: pkg/scheduler/partition.go
##########
@@ -704,6 +720,57 @@ func (pc *PartitionContext) tryReservedAllocate() *objects.Allocation {
 	return nil
 }
 
+// Try process placeholder for the partition
+// Lock free call this all locks are taken when needed in called functions
+func (pc *PartitionContext) tryPlaceholderAllocate() *objects.Allocation {
+	if !resources.StrictlyGreaterThanZero(pc.root.GetPendingResource()) {
+		// nothing to do just return
+		return nil
+	}
+	// try allocating from the root down
+	alloc := pc.root.TryPlaceholderAllocate(pc.GetAllNodeIterator)
+	if alloc != nil {
+		return pc.replace(alloc)
+	}
+	return nil
+}
+
+// Process the allocation and make the left over changes in the partition.
+func (pc *PartitionContext) replace(alloc *objects.Allocation) *objects.Allocation {
+	pc.Lock()
+	defer pc.Unlock()
+	// partition is locked nothing can change from now on
+	// find the app make sure it still exists
+	appID := alloc.ApplicationID
+	app := pc.applications[appID]
+	if app == nil {
+		log.Logger().Info("Application was removed while allocating",
+			zap.String("appID", appID))
+		return nil
+	}
+	// Safeguard against the unlikely case that we have clashes.
+	// A clash points to entropy issues on the node.
+	if _, found := pc.allocations[alloc.UUID]; found {
+		for {
+			allocationUUID := common.GetNewUUID()
+			log.Logger().Warn("UUID clash, random generator might be lacking entropy",
+				zap.String("uuid", alloc.UUID),
+				zap.String("new UUID", allocationUUID))
+			if pc.allocations[allocationUUID] == nil {
+				alloc.UUID = allocationUUID
+				break
+			}
+		}
+	}

Review comment:
       This is exactly the same that we have in the lines 822-835. Please extract this into a new function and reuse it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] codecov[bot] edited a comment on pull request #236: [YUNIKORN-476] placeholder allocation and swap

Posted by GitBox <gi...@apache.org>.
codecov[bot] edited a comment on pull request #236:
URL: https://github.com/apache/incubator-yunikorn-core/pull/236#issuecomment-755985425


   # [Codecov](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=h1) Report
   > Merging [#236](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=desc) (422f9be) into [master](https://codecov.io/gh/apache/incubator-yunikorn-core/commit/5471d84cd619dc9b6c40e970b0e9d53aa3f8e07f?el=desc) (5471d84) will **decrease** coverage by `0.95%`.
   > The diff coverage is `42.43%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/graphs/tree.svg?width=650&height=150&src=pr&token=SB9NrIi3Hy)](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master     #236      +/-   ##
   ==========================================
   - Coverage   64.02%   63.06%   -0.96%     
   ==========================================
     Files          60       60              
     Lines        4984     5199     +215     
   ==========================================
   + Hits         3191     3279      +88     
   - Misses       1636     1759     +123     
   - Partials      157      161       +4     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [pkg/scheduler/context.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9jb250ZXh0Lmdv) | `6.26% <0.00%> (-0.17%)` | :arrow_down: |
   | [pkg/scheduler/objects/node.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9vYmplY3RzL25vZGUuZ28=) | `81.81% <0.00%> (-1.52%)` | :arrow_down: |
   | [pkg/scheduler/objects/application.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9vYmplY3RzL2FwcGxpY2F0aW9uLmdv) | `49.06% <28.92%> (-6.25%)` | :arrow_down: |
   | [pkg/scheduler/partition.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9wYXJ0aXRpb24uZ28=) | `62.79% <44.15%> (-3.58%)` | :arrow_down: |
   | [pkg/scheduler/objects/queue.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9vYmplY3RzL3F1ZXVlLmdv) | `69.44% <58.06%> (-0.54%)` | :arrow_down: |
   | [pkg/common/utils.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL2NvbW1vbi91dGlscy5nbw==) | `77.50% <100.00%> (+5.62%)` | :arrow_up: |
   | [pkg/scheduler/objects/allocation.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9vYmplY3RzL2FsbG9jYXRpb24uZ28=) | `100.00% <100.00%> (+22.22%)` | :arrow_up: |
   | [pkg/scheduler/objects/allocation\_ask.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9vYmplY3RzL2FsbG9jYXRpb25fYXNrLmdv) | `93.87% <100.00%> (+2.96%)` | :arrow_up: |
   | ... and [1 more](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=footer). Last update [5471d84...18102bf](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] wilfred-s commented on a change in pull request #236: [YUNIKORN-476] placeholder allocation and swap

Posted by GitBox <gi...@apache.org>.
wilfred-s commented on a change in pull request #236:
URL: https://github.com/apache/incubator-yunikorn-core/pull/236#discussion_r556222855



##########
File path: pkg/scheduler/objects/application.go
##########
@@ -604,6 +630,121 @@ func (sa *Application) tryAllocate(headRoom *resources.Resource, nodeIterator fu
 	return nil
 }
 
+// Try to replace a placeholder with a real allocation
+func (sa *Application) tryPlaceholderAllocate(nodeIterator func() interfaces.NodeIterator) *Allocation {
+	// cannot allocate if the iterator is not giving us any nodes
+	iterator := nodeIterator()
+	if iterator == nil {
+		return nil
+	}
+	sa.Lock()
+	defer sa.Unlock()
+	// nothing to do if we have no placeholders allocated
+	if resources.IsZero(sa.allocatedPlaceholder) {
+		return nil
+	}
+	// make sure the request are sorted
+	sa.sortRequests(false)
+	// keep the first fits for later
+	var phFit *Allocation
+	var reqFit *AllocationAsk
+	// get all the requests from the app sorted in order
+	for _, request := range sa.sortedRequests {
+		// skip placeholders they follow standard allocation
+		// this should also be part of a task group just make sure it is
+		if request.placeholder || request.taskGroupName == "" {
+			continue
+		}
+		// walk over the placeholders, allow for processing all as we can have multiple task groups
+		phAllocs := sa.getPlaceholderAllocations()
+		for _, ph := range phAllocs {
+			// we could have already released this placeholder and are waiting for the shim to confirm
+			// and check that we have the correct task group before trying to swap
+			if ph.released || request.taskGroupName != ph.taskGroupName {
+				continue
+			}
+			if phFit == nil && reqFit == nil {
+				phFit = ph
+				reqFit = request
+			}
+			// this is inefficient but needs to do for now as we only have the ID not the handle on the node itself

Review comment:
       We iterate over all placeholders as not all placeholders might belong to the same TaskGroup and thus cannot be used. We must find the first placeholder that fits the taskgroup.
   
   We then retain the first fit on purpose. If we cannot replace any of the placeholder with any of the requests outstanding we try to allocate that one. It is not an arbitrary choice but it is the first request, based on the request sorting, that still has a placeholder that can be replaced. It is the request that we should allocate first before anything else.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] yangwwei commented on a change in pull request #236: [YUNIKORN-476] placeholder allocation and swap

Posted by GitBox <gi...@apache.org>.
yangwwei commented on a change in pull request #236:
URL: https://github.com/apache/incubator-yunikorn-core/pull/236#discussion_r554835180



##########
File path: pkg/scheduler/objects/application.go
##########
@@ -604,6 +630,121 @@ func (sa *Application) tryAllocate(headRoom *resources.Resource, nodeIterator fu
 	return nil
 }
 
+// Try to replace a placeholder with a real allocation
+func (sa *Application) tryPlaceholderAllocate(nodeIterator func() interfaces.NodeIterator) *Allocation {
+	// cannot allocate if the iterator is not giving us any nodes
+	iterator := nodeIterator()
+	if iterator == nil {
+		return nil
+	}
+	sa.Lock()
+	defer sa.Unlock()
+	// nothing to do if we have no placeholders allocated
+	if resources.IsZero(sa.allocatedPlaceholder) {
+		return nil
+	}
+	// make sure the request are sorted
+	sa.sortRequests(false)
+	// keep the first fits for later
+	var phFit *Allocation
+	var reqFit *AllocationAsk
+	// get all the requests from the app sorted in order
+	for _, request := range sa.sortedRequests {
+		// skip placeholders they follow standard allocation
+		// this should also be part of a task group just make sure it is
+		if request.placeholder || request.taskGroupName == "" {
+			continue
+		}
+		// walk over the placeholders, allow for processing all as we can have multiple task groups
+		phAllocs := sa.getPlaceholderAllocations()
+		for _, ph := range phAllocs {
+			// we could have already released this placeholder and are waiting for the shim to confirm
+			// and check that we have the correct task group before trying to swap
+			if ph.released || request.taskGroupName != ph.taskGroupName {
+				continue
+			}
+			if phFit == nil && reqFit == nil {
+				phFit = ph
+				reqFit = request
+			}
+			// this is inefficient but needs to do for now as we only have the ID not the handle on the node itself

Review comment:
       Agree, this is not efficient.
   We need to mark a TODO here, might be the first thing to fix after this.

##########
File path: pkg/scheduler/objects/application.go
##########
@@ -604,6 +630,121 @@ func (sa *Application) tryAllocate(headRoom *resources.Resource, nodeIterator fu
 	return nil
 }
 
+// Try to replace a placeholder with a real allocation
+func (sa *Application) tryPlaceholderAllocate(nodeIterator func() interfaces.NodeIterator) *Allocation {
+	// cannot allocate if the iterator is not giving us any nodes
+	iterator := nodeIterator()
+	if iterator == nil {
+		return nil
+	}
+	sa.Lock()
+	defer sa.Unlock()
+	// nothing to do if we have no placeholders allocated
+	if resources.IsZero(sa.allocatedPlaceholder) {
+		return nil
+	}
+	// make sure the request are sorted
+	sa.sortRequests(false)
+	// keep the first fits for later
+	var phFit *Allocation
+	var reqFit *AllocationAsk
+	// get all the requests from the app sorted in order
+	for _, request := range sa.sortedRequests {
+		// skip placeholders they follow standard allocation
+		// this should also be part of a task group just make sure it is
+		if request.placeholder || request.taskGroupName == "" {
+			continue
+		}
+		// walk over the placeholders, allow for processing all as we can have multiple task groups
+		phAllocs := sa.getPlaceholderAllocations()
+		for _, ph := range phAllocs {

Review comment:
       Do we really need to iterate all placeholders for each request?
   Ideally, the placeholders belonging to one task group are considered as homogeneous, same resources, same node-selector, and tolerations. If one placeholder fails, in most of the cases, all other placeholders will also fail.
   Can we try one placeholder replace here, if fail, we simply mark a release and run the normal scheduling for the pod? I think that will simplify the code a lot.

##########
File path: pkg/scheduler/objects/application.go
##########
@@ -604,6 +630,121 @@ func (sa *Application) tryAllocate(headRoom *resources.Resource, nodeIterator fu
 	return nil
 }
 
+// Try to replace a placeholder with a real allocation
+func (sa *Application) tryPlaceholderAllocate(nodeIterator func() interfaces.NodeIterator) *Allocation {
+	// cannot allocate if the iterator is not giving us any nodes
+	iterator := nodeIterator()
+	if iterator == nil {
+		return nil
+	}
+	sa.Lock()
+	defer sa.Unlock()
+	// nothing to do if we have no placeholders allocated
+	if resources.IsZero(sa.allocatedPlaceholder) {
+		return nil
+	}
+	// make sure the request are sorted
+	sa.sortRequests(false)
+	// keep the first fits for later
+	var phFit *Allocation
+	var reqFit *AllocationAsk
+	// get all the requests from the app sorted in order
+	for _, request := range sa.sortedRequests {
+		// skip placeholders they follow standard allocation
+		// this should also be part of a task group just make sure it is
+		if request.placeholder || request.taskGroupName == "" {
+			continue
+		}
+		// walk over the placeholders, allow for processing all as we can have multiple task groups
+		phAllocs := sa.getPlaceholderAllocations()
+		for _, ph := range phAllocs {
+			// we could have already released this placeholder and are waiting for the shim to confirm
+			// and check that we have the correct task group before trying to swap
+			if ph.released || request.taskGroupName != ph.taskGroupName {
+				continue
+			}
+			if phFit == nil && reqFit == nil {
+				phFit = ph
+				reqFit = request
+			}
+			// this is inefficient but needs to do for now as we only have the ID not the handle on the node itself
+			var ok bool
+			var node *Node
+			for iterator.HasNext() {
+				node, ok = iterator.Next().(*Node)
+				if !ok {
+					log.Logger().Warn("Node iterator failed to return a node")
+					return nil
+				}
+				if node.NodeID == ph.NodeID {
+					break

Review comment:
       Can we move line 685 to line 699 code to here?

##########
File path: pkg/scheduler/partition.go
##########
@@ -320,6 +320,22 @@ func (pc *PartitionContext) AddApplication(app *objects.Application) error {
 		return fmt.Errorf("failed to find queue %s for application %s", queueName, appID)
 	}
 
+	// check only for gang request
+	// - make sure the taskgroup request fits in the maximum set for the queue hierarchy
+	// - task groups should only be used in FIFO or StateAware queues
+	if placeHolder := app.GetPlaceholderAsk(); !resources.IsZero(placeHolder) {
+		// retrieve the max set
+		if maxQueue := queue.GetMaxQueueSet(); maxQueue != nil {
+			if !resources.FitIn(maxQueue, placeHolder) {
+				return fmt.Errorf("queue %s cannot fit application %s: task group request %s larger than max queue allocation", queueName, appID, placeHolder.String())
+			}
+		}
+		// check the queue sorting
+		if !queue.SupportTaskGroup() {
+			return fmt.Errorf("queue %s cannot run application %s with task group request: unsupported sort type", queueName, appID)
+		}

Review comment:
       Thanks for adding this check!
   

##########
File path: pkg/scheduler/objects/application.go
##########
@@ -604,6 +630,121 @@ func (sa *Application) tryAllocate(headRoom *resources.Resource, nodeIterator fu
 	return nil
 }
 
+// Try to replace a placeholder with a real allocation
+func (sa *Application) tryPlaceholderAllocate(nodeIterator func() interfaces.NodeIterator) *Allocation {
+	// cannot allocate if the iterator is not giving us any nodes
+	iterator := nodeIterator()
+	if iterator == nil {
+		return nil
+	}
+	sa.Lock()
+	defer sa.Unlock()
+	// nothing to do if we have no placeholders allocated
+	if resources.IsZero(sa.allocatedPlaceholder) {
+		return nil
+	}
+	// make sure the request are sorted
+	sa.sortRequests(false)
+	// keep the first fits for later
+	var phFit *Allocation
+	var reqFit *AllocationAsk
+	// get all the requests from the app sorted in order
+	for _, request := range sa.sortedRequests {
+		// skip placeholders they follow standard allocation
+		// this should also be part of a task group just make sure it is
+		if request.placeholder || request.taskGroupName == "" {
+			continue
+		}
+		// walk over the placeholders, allow for processing all as we can have multiple task groups
+		phAllocs := sa.getPlaceholderAllocations()
+		for _, ph := range phAllocs {
+			// we could have already released this placeholder and are waiting for the shim to confirm
+			// and check that we have the correct task group before trying to swap
+			if ph.released || request.taskGroupName != ph.taskGroupName {
+				continue
+			}
+			if phFit == nil && reqFit == nil {
+				phFit = ph
+				reqFit = request
+			}
+			// this is inefficient but needs to do for now as we only have the ID not the handle on the node itself
+			var ok bool
+			var node *Node
+			for iterator.HasNext() {
+				node, ok = iterator.Next().(*Node)
+				if !ok {
+					log.Logger().Warn("Node iterator failed to return a node")
+					return nil
+				}
+				if node.NodeID == ph.NodeID {
+					break
+				}
+			}
+			// got the node run same checks as for reservation (all but fits)
+			// resource usage should not change anyway between placeholder and real one
+			if node != nil && node.preReserveConditions(request.AllocationKey) {
+				alloc := NewAllocation(common.GetNewUUID(), node.NodeID, request)
+				// double link to make it easier to find
+				alloc.Releases = []*Allocation{ph}
+				ph.Releases = []*Allocation{alloc}
+				alloc.Result = Replaced
+				// mark placeholder as released
+				ph.released = true
+				_, err := sa.updateAskRepeatInternal(request, -1)
+				if err != nil {
+					log.Logger().Warn("ask repeat update failed unexpectedly",
+						zap.Error(err))
+				}
+				return alloc
+			}
+			iterator.Reset()
+		}
+	}
+	// we checked all placeholders and asks nothing worked as yet
+	// pick the first fit and try all nodes if that fails give up
+	if phFit != nil && reqFit != nil {

Review comment:
       IIUC, this is a fallback mechanism in case we can't find any valid node from the placeholders.
   In this case, it is more like a normal allocation. Can we merge this to the normal processing logic, aka `TryAllocate()`?

##########
File path: pkg/scheduler/objects/application.go
##########
@@ -604,6 +630,121 @@ func (sa *Application) tryAllocate(headRoom *resources.Resource, nodeIterator fu
 	return nil
 }
 
+// Try to replace a placeholder with a real allocation
+func (sa *Application) tryPlaceholderAllocate(nodeIterator func() interfaces.NodeIterator) *Allocation {
+	// cannot allocate if the iterator is not giving us any nodes
+	iterator := nodeIterator()
+	if iterator == nil {
+		return nil
+	}
+	sa.Lock()
+	defer sa.Unlock()
+	// nothing to do if we have no placeholders allocated
+	if resources.IsZero(sa.allocatedPlaceholder) {
+		return nil
+	}
+	// make sure the request are sorted
+	sa.sortRequests(false)
+	// keep the first fits for later
+	var phFit *Allocation
+	var reqFit *AllocationAsk
+	// get all the requests from the app sorted in order
+	for _, request := range sa.sortedRequests {
+		// skip placeholders they follow standard allocation
+		// this should also be part of a task group just make sure it is
+		if request.placeholder || request.taskGroupName == "" {
+			continue
+		}
+		// walk over the placeholders, allow for processing all as we can have multiple task groups
+		phAllocs := sa.getPlaceholderAllocations()
+		for _, ph := range phAllocs {
+			// we could have already released this placeholder and are waiting for the shim to confirm
+			// and check that we have the correct task group before trying to swap
+			if ph.released || request.taskGroupName != ph.taskGroupName {
+				continue
+			}
+			if phFit == nil && reqFit == nil {
+				phFit = ph
+				reqFit = request
+			}
+			// this is inefficient but needs to do for now as we only have the ID not the handle on the node itself
+			var ok bool
+			var node *Node
+			for iterator.HasNext() {
+				node, ok = iterator.Next().(*Node)
+				if !ok {
+					log.Logger().Warn("Node iterator failed to return a node")
+					return nil
+				}
+				if node.NodeID == ph.NodeID {
+					break
+				}
+			}
+			// got the node run same checks as for reservation (all but fits)
+			// resource usage should not change anyway between placeholder and real one
+			if node != nil && node.preReserveConditions(request.AllocationKey) {
+				alloc := NewAllocation(common.GetNewUUID(), node.NodeID, request)
+				// double link to make it easier to find
+				alloc.Releases = []*Allocation{ph}
+				ph.Releases = []*Allocation{alloc}
+				alloc.Result = Replaced
+				// mark placeholder as released
+				ph.released = true
+				_, err := sa.updateAskRepeatInternal(request, -1)
+				if err != nil {
+					log.Logger().Warn("ask repeat update failed unexpectedly",
+						zap.Error(err))
+				}
+				return alloc
+			}
+			iterator.Reset()
+		}
+	}
+	// we checked all placeholders and asks nothing worked as yet
+	// pick the first fit and try all nodes if that fails give up
+	if phFit != nil && reqFit != nil {
+		for iterator.HasNext() {
+			node, ok := iterator.Next().(*Node)
+			if !ok {
+				log.Logger().Warn("Node iterator failed to return a node")
+				return nil
+			}
+			if err := node.preAllocateCheck(reqFit.AllocatedResource, reservationKey(nil, sa, reqFit), false); err != nil {
+				continue
+			}
+			// skip the node if conditions can not be satisfied
+			if !node.preAllocateConditions(reqFit.AllocationKey) {
+				continue
+			}
+			// allocation worked: on a non placeholder node update result and return
+			alloc := NewAllocation(common.GetNewUUID(), node.NodeID, reqFit)
+			// double link to make it easier to find
+			alloc.Releases = []*Allocation{phFit}
+			phFit.Releases = []*Allocation{alloc}
+			alloc.Result = Replaced
+			// mark placeholder as released
+			phFit.released = true
+			// update just the node to make sure we keep its spot
+			// no queue update as we're releasing the placeholder and are just temp over the size
+			if !node.AddAllocation(alloc) {
+				log.Logger().Debug("Node update failed unexpectedly",
+					zap.String("applicationID", sa.ApplicationID),
+					zap.String("ask", reqFit.String()),
+					zap.String("placeholder", phFit.String()))
+				return nil

Review comment:
       We do not need to wait for a placeholder to release before updating the node resources?
   This seems diverse from the normal processing logic. 
   If we can merge this part to the normal processing code, can we do something like the following:
   
   1. Placeholder allocate fails (doesn't fit)
   2. Mark a placeholder to release, "park" the request
   3. Wait until the placeholder is released in the shim, do normal allocation for the request
   
   This means: the placeholder wasn't good, so we try to schedule the pod as before.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] wilfred-s commented on a change in pull request #236: [YUNIKORN-476] placeholder allocation and swap

Posted by GitBox <gi...@apache.org>.
wilfred-s commented on a change in pull request #236:
URL: https://github.com/apache/incubator-yunikorn-core/pull/236#discussion_r556225514



##########
File path: pkg/scheduler/partition.go
##########
@@ -704,6 +720,57 @@ func (pc *PartitionContext) tryReservedAllocate() *objects.Allocation {
 	return nil
 }
 
+// Try process placeholder for the partition
+// Lock free call this all locks are taken when needed in called functions
+func (pc *PartitionContext) tryPlaceholderAllocate() *objects.Allocation {
+	if !resources.StrictlyGreaterThanZero(pc.root.GetPendingResource()) {
+		// nothing to do just return
+		return nil
+	}
+	// try allocating from the root down
+	alloc := pc.root.TryPlaceholderAllocate(pc.GetAllNodeIterator)
+	if alloc != nil {
+		return pc.replace(alloc)
+	}
+	return nil
+}
+
+// Process the allocation and make the left over changes in the partition.
+func (pc *PartitionContext) replace(alloc *objects.Allocation) *objects.Allocation {
+	pc.Lock()
+	defer pc.Unlock()
+	// partition is locked nothing can change from now on
+	// find the app make sure it still exists
+	appID := alloc.ApplicationID
+	app := pc.applications[appID]
+	if app == nil {
+		log.Logger().Info("Application was removed while allocating",
+			zap.String("appID", appID))
+		return nil
+	}
+	// Safeguard against the unlikely case that we have clashes.
+	// A clash points to entropy issues on the node.
+	if _, found := pc.allocations[alloc.UUID]; found {
+		for {
+			allocationUUID := common.GetNewUUID()
+			log.Logger().Warn("UUID clash, random generator might be lacking entropy",
+				zap.String("uuid", alloc.UUID),
+				zap.String("new UUID", allocationUUID))
+			if pc.allocations[allocationUUID] == nil {
+				alloc.UUID = allocationUUID
+				break
+			}
+		}
+	}

Review comment:
       refactored into a new `checkUUID()` method




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] wilfred-s commented on pull request #236: [YUNIKORN-476] placeholder allocation and swap

Posted by GitBox <gi...@apache.org>.
wilfred-s commented on pull request #236:
URL: https://github.com/apache/incubator-yunikorn-core/pull/236#issuecomment-755991876


   I want to add further tests after the weekend. The code coverage should not drop specially as this is a new often executed  allocation code.
   With the latest events change we get the full gang working based on the offline tests wih @yangwwei 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] yangwwei commented on a change in pull request #236: [YUNIKORN-476] placeholder allocation and swap

Posted by GitBox <gi...@apache.org>.
yangwwei commented on a change in pull request #236:
URL: https://github.com/apache/incubator-yunikorn-core/pull/236#discussion_r555477620



##########
File path: pkg/scheduler/objects/application.go
##########
@@ -604,6 +630,121 @@ func (sa *Application) tryAllocate(headRoom *resources.Resource, nodeIterator fu
 	return nil
 }
 
+// Try to replace a placeholder with a real allocation
+func (sa *Application) tryPlaceholderAllocate(nodeIterator func() interfaces.NodeIterator) *Allocation {
+	// cannot allocate if the iterator is not giving us any nodes
+	iterator := nodeIterator()
+	if iterator == nil {
+		return nil
+	}
+	sa.Lock()
+	defer sa.Unlock()
+	// nothing to do if we have no placeholders allocated
+	if resources.IsZero(sa.allocatedPlaceholder) {
+		return nil
+	}
+	// make sure the request are sorted
+	sa.sortRequests(false)
+	// keep the first fits for later
+	var phFit *Allocation
+	var reqFit *AllocationAsk
+	// get all the requests from the app sorted in order
+	for _, request := range sa.sortedRequests {
+		// skip placeholders they follow standard allocation
+		// this should also be part of a task group just make sure it is
+		if request.placeholder || request.taskGroupName == "" {
+			continue
+		}
+		// walk over the placeholders, allow for processing all as we can have multiple task groups
+		phAllocs := sa.getPlaceholderAllocations()
+		for _, ph := range phAllocs {
+			// we could have already released this placeholder and are waiting for the shim to confirm
+			// and check that we have the correct task group before trying to swap
+			if ph.released || request.taskGroupName != ph.taskGroupName {
+				continue
+			}
+			if phFit == nil && reqFit == nil {
+				phFit = ph
+				reqFit = request
+			}
+			// this is inefficient but needs to do for now as we only have the ID not the handle on the node itself
+			var ok bool
+			var node *Node
+			for iterator.HasNext() {
+				node, ok = iterator.Next().(*Node)
+				if !ok {
+					log.Logger().Warn("Node iterator failed to return a node")
+					return nil
+				}
+				if node.NodeID == ph.NodeID {
+					break
+				}
+			}
+			// got the node run same checks as for reservation (all but fits)
+			// resource usage should not change anyway between placeholder and real one
+			if node != nil && node.preReserveConditions(request.AllocationKey) {
+				alloc := NewAllocation(common.GetNewUUID(), node.NodeID, request)
+				// double link to make it easier to find
+				alloc.Releases = []*Allocation{ph}
+				ph.Releases = []*Allocation{alloc}
+				alloc.Result = Replaced
+				// mark placeholder as released
+				ph.released = true
+				_, err := sa.updateAskRepeatInternal(request, -1)
+				if err != nil {
+					log.Logger().Warn("ask repeat update failed unexpectedly",
+						zap.Error(err))
+				}
+				return alloc
+			}
+			iterator.Reset()
+		}
+	}
+	// we checked all placeholders and asks nothing worked as yet
+	// pick the first fit and try all nodes if that fails give up
+	if phFit != nil && reqFit != nil {
+		for iterator.HasNext() {
+			node, ok := iterator.Next().(*Node)
+			if !ok {
+				log.Logger().Warn("Node iterator failed to return a node")
+				return nil
+			}
+			if err := node.preAllocateCheck(reqFit.AllocatedResource, reservationKey(nil, sa, reqFit), false); err != nil {
+				continue
+			}
+			// skip the node if conditions can not be satisfied
+			if !node.preAllocateConditions(reqFit.AllocationKey) {
+				continue
+			}
+			// allocation worked: on a non placeholder node update result and return
+			alloc := NewAllocation(common.GetNewUUID(), node.NodeID, reqFit)
+			// double link to make it easier to find
+			alloc.Releases = []*Allocation{phFit}
+			phFit.Releases = []*Allocation{alloc}
+			alloc.Result = Replaced
+			// mark placeholder as released
+			phFit.released = true
+			// update just the node to make sure we keep its spot
+			// no queue update as we're releasing the placeholder and are just temp over the size
+			if !node.AddAllocation(alloc) {
+				log.Logger().Debug("Node update failed unexpectedly",
+					zap.String("applicationID", sa.ApplicationID),
+					zap.String("ask", reqFit.String()),
+					zap.String("placeholder", phFit.String()))
+				return nil

Review comment:
       > We do not need to wait for a placeholder to release before updating the node resources?
   > This seems diverse from the normal processing logic.
   > If we can merge this part to the normal processing code, can we do something like the following:
   > 
   > 1. Placeholder allocate fails (doesn't fit)
   > 2. Mark a placeholder to release, "park" the request
   > 3. Wait until the placeholder is released in the shim, do normal allocation for the request
   > 
   > This means: the placeholder wasn't good, so we try to schedule the pod as before.
   
   Just add to this. One concern I have is we may go over queue resource quota with this.
   Imagine a case we have 10 placeholders and 10 real pods for an app, at the beginning we have 10 placeholders allocated, where we have 
   
   10 placeholders 1GB * 10 = 10GB
     - app: 10GB in placeholders
     - nodes: 10GB in allocated resources
     - queue: 10GB allocated resources
   
   imagine the worst-case scenario, the placeholders don't fit, and we start to fallback allocating the real pods on other nodes. Then we get
   
     - app: 10GB in placeholders, 10GB in allocated resources
     - nodes: 20GB
     - queue: 10GB (* dangerous)
   
   *dangerous: both app and nodes are tracking 20GB used resources for this app, but for the queue, it is 10GB. If the queue has a resource quota set to 20GB. It might be possible that before shim releases all 10 placeholders, the core will schedule other pods from other apps, that will goes over the quota defined for this queue.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] codecov[bot] edited a comment on pull request #236: [YUNIKORN-476] placeholder allocation and swap

Posted by GitBox <gi...@apache.org>.
codecov[bot] edited a comment on pull request #236:
URL: https://github.com/apache/incubator-yunikorn-core/pull/236#issuecomment-755985425


   # [Codecov](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=h1) Report
   > Merging [#236](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=desc) (18102bf) into [master](https://codecov.io/gh/apache/incubator-yunikorn-core/commit/5471d84cd619dc9b6c40e970b0e9d53aa3f8e07f?el=desc) (5471d84) will **decrease** coverage by `0.95%`.
   > The diff coverage is `43.56%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/graphs/tree.svg?width=650&height=150&src=pr&token=SB9NrIi3Hy)](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master     #236      +/-   ##
   ==========================================
   - Coverage   64.02%   63.06%   -0.96%     
   ==========================================
     Files          60       60              
     Lines        4984     5199     +215     
   ==========================================
   + Hits         3191     3279      +88     
   - Misses       1636     1759     +123     
   - Partials      157      161       +4     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [pkg/scheduler/context.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9jb250ZXh0Lmdv) | `6.26% <0.00%> (-0.17%)` | :arrow_down: |
   | [pkg/scheduler/objects/node.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9vYmplY3RzL25vZGUuZ28=) | `81.81% <0.00%> (-1.52%)` | :arrow_down: |
   | [pkg/scheduler/objects/application.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9vYmplY3RzL2FwcGxpY2F0aW9uLmdv) | `49.06% <28.92%> (-6.25%)` | :arrow_down: |
   | [pkg/scheduler/partition.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9wYXJ0aXRpb24uZ28=) | `62.79% <47.43%> (-3.58%)` | :arrow_down: |
   | [pkg/scheduler/objects/queue.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9vYmplY3RzL3F1ZXVlLmdv) | `69.44% <58.06%> (-0.54%)` | :arrow_down: |
   | [pkg/common/utils.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL2NvbW1vbi91dGlscy5nbw==) | `77.50% <100.00%> (+5.62%)` | :arrow_up: |
   | [pkg/scheduler/objects/allocation.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9vYmplY3RzL2FsbG9jYXRpb24uZ28=) | `100.00% <100.00%> (+22.22%)` | :arrow_up: |
   | [pkg/scheduler/objects/allocation\_ask.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9vYmplY3RzL2FsbG9jYXRpb25fYXNrLmdv) | `93.87% <100.00%> (+2.96%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=footer). Last update [5471d84...18102bf](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] wilfred-s commented on a change in pull request #236: [YUNIKORN-476] placeholder allocation and swap

Posted by GitBox <gi...@apache.org>.
wilfred-s commented on a change in pull request #236:
URL: https://github.com/apache/incubator-yunikorn-core/pull/236#discussion_r556226899



##########
File path: pkg/scheduler/objects/application.go
##########
@@ -604,6 +630,121 @@ func (sa *Application) tryAllocate(headRoom *resources.Resource, nodeIterator fu
 	return nil
 }
 
+// Try to replace a placeholder with a real allocation
+func (sa *Application) tryPlaceholderAllocate(nodeIterator func() interfaces.NodeIterator) *Allocation {
+	// cannot allocate if the iterator is not giving us any nodes
+	iterator := nodeIterator()
+	if iterator == nil {
+		return nil
+	}
+	sa.Lock()
+	defer sa.Unlock()
+	// nothing to do if we have no placeholders allocated
+	if resources.IsZero(sa.allocatedPlaceholder) {
+		return nil
+	}
+	// make sure the request are sorted
+	sa.sortRequests(false)
+	// keep the first fits for later
+	var phFit *Allocation
+	var reqFit *AllocationAsk
+	// get all the requests from the app sorted in order
+	for _, request := range sa.sortedRequests {
+		// skip placeholders they follow standard allocation
+		// this should also be part of a task group just make sure it is
+		if request.placeholder || request.taskGroupName == "" {
+			continue
+		}
+		// walk over the placeholders, allow for processing all as we can have multiple task groups
+		phAllocs := sa.getPlaceholderAllocations()
+		for _, ph := range phAllocs {
+			// we could have already released this placeholder and are waiting for the shim to confirm
+			// and check that we have the correct task group before trying to swap
+			if ph.released || request.taskGroupName != ph.taskGroupName {
+				continue
+			}
+			if phFit == nil && reqFit == nil {
+				phFit = ph
+				reqFit = request
+			}
+			// this is inefficient but needs to do for now as we only have the ID not the handle on the node itself
+			var ok bool
+			var node *Node
+			for iterator.HasNext() {
+				node, ok = iterator.Next().(*Node)
+				if !ok {
+					log.Logger().Warn("Node iterator failed to return a node")
+					return nil
+				}
+				if node.NodeID == ph.NodeID {
+					break
+				}
+			}
+			// got the node run same checks as for reservation (all but fits)
+			// resource usage should not change anyway between placeholder and real one
+			if node != nil && node.preReserveConditions(request.AllocationKey) {
+				alloc := NewAllocation(common.GetNewUUID(), node.NodeID, request)
+				// double link to make it easier to find
+				alloc.Releases = []*Allocation{ph}
+				ph.Releases = []*Allocation{alloc}
+				alloc.Result = Replaced
+				// mark placeholder as released
+				ph.released = true
+				_, err := sa.updateAskRepeatInternal(request, -1)
+				if err != nil {
+					log.Logger().Warn("ask repeat update failed unexpectedly",
+						zap.Error(err))
+				}
+				return alloc
+			}
+			iterator.Reset()
+		}
+	}
+	// we checked all placeholders and asks nothing worked as yet
+	// pick the first fit and try all nodes if that fails give up
+	if phFit != nil && reqFit != nil {
+		for iterator.HasNext() {
+			node, ok := iterator.Next().(*Node)
+			if !ok {
+				log.Logger().Warn("Node iterator failed to return a node")
+				return nil
+			}
+			if err := node.preAllocateCheck(reqFit.AllocatedResource, reservationKey(nil, sa, reqFit), false); err != nil {
+				continue
+			}
+			// skip the node if conditions can not be satisfied
+			if !node.preAllocateConditions(reqFit.AllocationKey) {
+				continue
+			}
+			// allocation worked: on a non placeholder node update result and return
+			alloc := NewAllocation(common.GetNewUUID(), node.NodeID, reqFit)
+			// double link to make it easier to find
+			alloc.Releases = []*Allocation{phFit}
+			phFit.Releases = []*Allocation{alloc}
+			alloc.Result = Replaced

Review comment:
       comments updated in two locations to clear this up: it is doing the right thing release on node A and adds on node B




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] yangwwei merged pull request #236: [YUNIKORN-476] placeholder allocation and swap

Posted by GitBox <gi...@apache.org>.
yangwwei merged pull request #236:
URL: https://github.com/apache/incubator-yunikorn-core/pull/236


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] yangwwei commented on pull request #236: [YUNIKORN-476] placeholder allocation and swap

Posted by GitBox <gi...@apache.org>.
yangwwei commented on pull request #236:
URL: https://github.com/apache/incubator-yunikorn-core/pull/236#issuecomment-759629088


   I had a long discussion with @wilfred-s about the main concern I have about this PR.
   Like we discussed [in this comment](https://github.com/apache/incubator-yunikorn-core/pull/236#discussion_r554867952) , it is not likely something that can be easily fixed without causing the code to be more complex. I've already done some basic tests, the current code works. Our conclusion is to get this merged, run more tests (including the negative scenarios), revisit this if we find any issue. For the review stage, we don't need to track ourselves here.
   So I am +1 on the change and let's get this merged.
   The tests are passing, the code coverage can be improved in a follow-up JIRA. Also @kingamarton please file a JIRA to address the latest comment you added, they make sense.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] wilfred-s commented on a change in pull request #236: [YUNIKORN-476] placeholder allocation and swap

Posted by GitBox <gi...@apache.org>.
wilfred-s commented on a change in pull request #236:
URL: https://github.com/apache/incubator-yunikorn-core/pull/236#discussion_r556227774



##########
File path: pkg/scheduler/partition.go
##########
@@ -704,6 +720,57 @@ func (pc *PartitionContext) tryReservedAllocate() *objects.Allocation {
 	return nil
 }
 
+// Try process placeholder for the partition
+// Lock free call this all locks are taken when needed in called functions
+func (pc *PartitionContext) tryPlaceholderAllocate() *objects.Allocation {
+	if !resources.StrictlyGreaterThanZero(pc.root.GetPendingResource()) {
+		// nothing to do just return
+		return nil
+	}
+	// try allocating from the root down
+	alloc := pc.root.TryPlaceholderAllocate(pc.GetAllNodeIterator)
+	if alloc != nil {
+		return pc.replace(alloc)
+	}
+	return nil
+}
+
+// Process the allocation and make the left over changes in the partition.
+func (pc *PartitionContext) replace(alloc *objects.Allocation) *objects.Allocation {
+	pc.Lock()
+	defer pc.Unlock()
+	// partition is locked nothing can change from now on

Review comment:
       removed




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] codecov[bot] edited a comment on pull request #236: [YUNIKORN-476] placeholder allocation and swap

Posted by GitBox <gi...@apache.org>.
codecov[bot] edited a comment on pull request #236:
URL: https://github.com/apache/incubator-yunikorn-core/pull/236#issuecomment-755985425


   # [Codecov](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=h1) Report
   > Merging [#236](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=desc) (98a2634) into [master](https://codecov.io/gh/apache/incubator-yunikorn-core/commit/5471d84cd619dc9b6c40e970b0e9d53aa3f8e07f?el=desc) (5471d84) will **decrease** coverage by `1.17%`.
   > The diff coverage is `39.49%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/graphs/tree.svg?width=650&height=150&src=pr&token=SB9NrIi3Hy)](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master     #236      +/-   ##
   ==========================================
   - Coverage   64.02%   62.84%   -1.18%     
   ==========================================
     Files          60       60              
     Lines        4984     5216     +232     
   ==========================================
   + Hits         3191     3278      +87     
   - Misses       1636     1777     +141     
   - Partials      157      161       +4     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [pkg/scheduler/context.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9jb250ZXh0Lmdv) | `6.26% <0.00%> (-0.17%)` | :arrow_down: |
   | [pkg/scheduler/objects/node.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9vYmplY3RzL25vZGUuZ28=) | `81.81% <0.00%> (-1.52%)` | :arrow_down: |
   | [pkg/scheduler/objects/application.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9vYmplY3RzL2FwcGxpY2F0aW9uLmdv) | `48.33% <27.13%> (-6.97%)` | :arrow_down: |
   | [pkg/scheduler/partition.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9wYXJ0aXRpb24uZ28=) | `61.70% <36.25%> (-4.67%)` | :arrow_down: |
   | [pkg/scheduler/objects/queue.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9vYmplY3RzL3F1ZXVlLmdv) | `69.44% <58.06%> (-0.54%)` | :arrow_down: |
   | [pkg/common/utils.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL2NvbW1vbi91dGlscy5nbw==) | `77.50% <100.00%> (+5.62%)` | :arrow_up: |
   | [pkg/scheduler/objects/allocation.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9vYmplY3RzL2FsbG9jYXRpb24uZ28=) | `100.00% <100.00%> (+22.22%)` | :arrow_up: |
   | [pkg/scheduler/objects/allocation\_ask.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9vYmplY3RzL2FsbG9jYXRpb25fYXNrLmdv) | `93.87% <100.00%> (+2.96%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=footer). Last update [5471d84...a019a8c](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] wilfred-s commented on a change in pull request #236: [YUNIKORN-476] placeholder allocation and swap

Posted by GitBox <gi...@apache.org>.
wilfred-s commented on a change in pull request #236:
URL: https://github.com/apache/incubator-yunikorn-core/pull/236#discussion_r556220781



##########
File path: pkg/scheduler/objects/application.go
##########
@@ -604,6 +630,121 @@ func (sa *Application) tryAllocate(headRoom *resources.Resource, nodeIterator fu
 	return nil
 }
 
+// Try to replace a placeholder with a real allocation
+func (sa *Application) tryPlaceholderAllocate(nodeIterator func() interfaces.NodeIterator) *Allocation {
+	// cannot allocate if the iterator is not giving us any nodes
+	iterator := nodeIterator()
+	if iterator == nil {
+		return nil
+	}
+	sa.Lock()
+	defer sa.Unlock()
+	// nothing to do if we have no placeholders allocated
+	if resources.IsZero(sa.allocatedPlaceholder) {
+		return nil
+	}
+	// make sure the request are sorted
+	sa.sortRequests(false)
+	// keep the first fits for later
+	var phFit *Allocation
+	var reqFit *AllocationAsk
+	// get all the requests from the app sorted in order
+	for _, request := range sa.sortedRequests {
+		// skip placeholders they follow standard allocation
+		// this should also be part of a task group just make sure it is
+		if request.placeholder || request.taskGroupName == "" {
+			continue
+		}
+		// walk over the placeholders, allow for processing all as we can have multiple task groups
+		phAllocs := sa.getPlaceholderAllocations()
+		for _, ph := range phAllocs {
+			// we could have already released this placeholder and are waiting for the shim to confirm
+			// and check that we have the correct task group before trying to swap
+			if ph.released || request.taskGroupName != ph.taskGroupName {
+				continue
+			}
+			if phFit == nil && reqFit == nil {
+				phFit = ph
+				reqFit = request
+			}
+			// this is inefficient but needs to do for now as we only have the ID not the handle on the node itself
+			var ok bool
+			var node *Node
+			for iterator.HasNext() {
+				node, ok = iterator.Next().(*Node)
+				if !ok {
+					log.Logger().Warn("Node iterator failed to return a node")
+					return nil
+				}
+				if node.NodeID == ph.NodeID {
+					break
+				}
+			}
+			// got the node run same checks as for reservation (all but fits)
+			// resource usage should not change anyway between placeholder and real one
+			if node != nil && node.preReserveConditions(request.AllocationKey) {
+				alloc := NewAllocation(common.GetNewUUID(), node.NodeID, request)
+				// double link to make it easier to find
+				alloc.Releases = []*Allocation{ph}
+				ph.Releases = []*Allocation{alloc}
+				alloc.Result = Replaced
+				// mark placeholder as released
+				ph.released = true
+				_, err := sa.updateAskRepeatInternal(request, -1)
+				if err != nil {
+					log.Logger().Warn("ask repeat update failed unexpectedly",
+						zap.Error(err))
+				}
+				return alloc
+			}
+			iterator.Reset()
+		}
+	}
+	// we checked all placeholders and asks nothing worked as yet
+	// pick the first fit and try all nodes if that fails give up
+	if phFit != nil && reqFit != nil {
+		for iterator.HasNext() {
+			node, ok := iterator.Next().(*Node)
+			if !ok {
+				log.Logger().Warn("Node iterator failed to return a node")
+				return nil
+			}
+			if err := node.preAllocateCheck(reqFit.AllocatedResource, reservationKey(nil, sa, reqFit), false); err != nil {
+				continue
+			}
+			// skip the node if conditions can not be satisfied
+			if !node.preAllocateConditions(reqFit.AllocationKey) {
+				continue
+			}
+			// allocation worked: on a non placeholder node update result and return
+			alloc := NewAllocation(common.GetNewUUID(), node.NodeID, reqFit)
+			// double link to make it easier to find
+			alloc.Releases = []*Allocation{phFit}
+			phFit.Releases = []*Allocation{alloc}
+			alloc.Result = Replaced
+			// mark placeholder as released
+			phFit.released = true
+			// update just the node to make sure we keep its spot
+			// no queue update as we're releasing the placeholder and are just temp over the size
+			if !node.AddAllocation(alloc) {
+				log.Logger().Debug("Node update failed unexpectedly",
+					zap.String("applicationID", sa.ApplicationID),
+					zap.String("ask", reqFit.String()),
+					zap.String("placeholder", phFit.String()))
+				return nil

Review comment:
       Unless you can in some way guarantee that a node cannot change over time there is no way that you can get that 100% guarantee for gang scheduling. This is just not possible, we're close to it but not there. What this approach tries to do is making sure that the exception case is a small as possible and handled as well as possible. Preventing breakage where possible.
   
   An early release of a placeholder from the queue's total use will break it at the point that we have told the application that the gang is reserved. In the case that there is a limit on the queue it could even mean that we cannot get that quota back as an other application in the queue has started using it. It might thus cause such a problem that the gang can never be fully allocated.
   The only other option is to reserve the node so we do not lose the node. Then on the release confirmation from the placeholder finish the real allocation. The release would be processed on the node and in the app. We would need to push the real allocation through a special allocation cycle to update the "new" node. Then update the app with the final result.
   We would also somehow make sure that the reservation is removed from the normal `TryReserveAllocate` cycle adding another complexity. The node that got reserved will be unavailable for scheduling all the time the release is being processed. That also does not feel correct and really complex.
   
   For the temporary discrepancy: it *only* happens if all the placeholder run on nodes that can not for some reason support the real allocation (it is the failure case). It lasts only for the time it takes K8s to release the placeholder and it is all back in sync again.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] codecov[bot] commented on pull request #236: [YUNIKORN-476] placeholder allocation and swap

Posted by GitBox <gi...@apache.org>.
codecov[bot] commented on pull request #236:
URL: https://github.com/apache/incubator-yunikorn-core/pull/236#issuecomment-755985425


   # [Codecov](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=h1) Report
   > Merging [#236](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=desc) (98a2634) into [master](https://codecov.io/gh/apache/incubator-yunikorn-core/commit/5471d84cd619dc9b6c40e970b0e9d53aa3f8e07f?el=desc) (5471d84) will **decrease** coverage by `1.17%`.
   > The diff coverage is `39.49%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/graphs/tree.svg?width=650&height=150&src=pr&token=SB9NrIi3Hy)](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master     #236      +/-   ##
   ==========================================
   - Coverage   64.02%   62.84%   -1.18%     
   ==========================================
     Files          60       60              
     Lines        4984     5216     +232     
   ==========================================
   + Hits         3191     3278      +87     
   - Misses       1636     1777     +141     
   - Partials      157      161       +4     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [pkg/scheduler/context.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9jb250ZXh0Lmdv) | `6.26% <0.00%> (-0.17%)` | :arrow_down: |
   | [pkg/scheduler/objects/node.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9vYmplY3RzL25vZGUuZ28=) | `81.81% <0.00%> (-1.52%)` | :arrow_down: |
   | [pkg/scheduler/objects/application.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9vYmplY3RzL2FwcGxpY2F0aW9uLmdv) | `48.33% <27.13%> (-6.97%)` | :arrow_down: |
   | [pkg/scheduler/partition.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9wYXJ0aXRpb24uZ28=) | `61.70% <36.25%> (-4.67%)` | :arrow_down: |
   | [pkg/scheduler/objects/queue.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9vYmplY3RzL3F1ZXVlLmdv) | `69.44% <58.06%> (-0.54%)` | :arrow_down: |
   | [pkg/common/utils.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL2NvbW1vbi91dGlscy5nbw==) | `77.50% <100.00%> (+5.62%)` | :arrow_up: |
   | [pkg/scheduler/objects/allocation.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9vYmplY3RzL2FsbG9jYXRpb24uZ28=) | `100.00% <100.00%> (+22.22%)` | :arrow_up: |
   | [pkg/scheduler/objects/allocation\_ask.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9vYmplY3RzL2FsbG9jYXRpb25fYXNrLmdv) | `93.87% <100.00%> (+2.96%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=footer). Last update [5471d84...98a2634](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] codecov[bot] edited a comment on pull request #236: [YUNIKORN-476] placeholder allocation and swap

Posted by GitBox <gi...@apache.org>.
codecov[bot] edited a comment on pull request #236:
URL: https://github.com/apache/incubator-yunikorn-core/pull/236#issuecomment-755985425


   # [Codecov](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=h1) Report
   > Merging [#236](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=desc) (98a2634) into [master](https://codecov.io/gh/apache/incubator-yunikorn-core/commit/5471d84cd619dc9b6c40e970b0e9d53aa3f8e07f?el=desc) (5471d84) will **decrease** coverage by `1.17%`.
   > The diff coverage is `39.49%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/graphs/tree.svg?width=650&height=150&src=pr&token=SB9NrIi3Hy)](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master     #236      +/-   ##
   ==========================================
   - Coverage   64.02%   62.84%   -1.18%     
   ==========================================
     Files          60       60              
     Lines        4984     5216     +232     
   ==========================================
   + Hits         3191     3278      +87     
   - Misses       1636     1777     +141     
   - Partials      157      161       +4     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [pkg/scheduler/context.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9jb250ZXh0Lmdv) | `6.26% <0.00%> (-0.17%)` | :arrow_down: |
   | [pkg/scheduler/objects/node.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9vYmplY3RzL25vZGUuZ28=) | `81.81% <0.00%> (-1.52%)` | :arrow_down: |
   | [pkg/scheduler/objects/application.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9vYmplY3RzL2FwcGxpY2F0aW9uLmdv) | `48.33% <27.13%> (-6.97%)` | :arrow_down: |
   | [pkg/scheduler/partition.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9wYXJ0aXRpb24uZ28=) | `61.70% <36.25%> (-4.67%)` | :arrow_down: |
   | [pkg/scheduler/objects/queue.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9vYmplY3RzL3F1ZXVlLmdv) | `69.44% <58.06%> (-0.54%)` | :arrow_down: |
   | [pkg/common/utils.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL2NvbW1vbi91dGlscy5nbw==) | `77.50% <100.00%> (+5.62%)` | :arrow_up: |
   | [pkg/scheduler/objects/allocation.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9vYmplY3RzL2FsbG9jYXRpb24uZ28=) | `100.00% <100.00%> (+22.22%)` | :arrow_up: |
   | [pkg/scheduler/objects/allocation\_ask.go](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236/diff?src=pr&el=tree#diff-cGtnL3NjaGVkdWxlci9vYmplY3RzL2FsbG9jYXRpb25fYXNrLmdv) | `93.87% <100.00%> (+2.96%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=footer). Last update [5471d84...18102bf](https://codecov.io/gh/apache/incubator-yunikorn-core/pull/236?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] yangwwei commented on a change in pull request #236: [YUNIKORN-476] placeholder allocation and swap

Posted by GitBox <gi...@apache.org>.
yangwwei commented on a change in pull request #236:
URL: https://github.com/apache/incubator-yunikorn-core/pull/236#discussion_r556160298



##########
File path: pkg/scheduler/objects/application.go
##########
@@ -604,6 +630,121 @@ func (sa *Application) tryAllocate(headRoom *resources.Resource, nodeIterator fu
 	return nil
 }
 
+// Try to replace a placeholder with a real allocation
+func (sa *Application) tryPlaceholderAllocate(nodeIterator func() interfaces.NodeIterator) *Allocation {
+	// cannot allocate if the iterator is not giving us any nodes
+	iterator := nodeIterator()
+	if iterator == nil {
+		return nil
+	}
+	sa.Lock()
+	defer sa.Unlock()
+	// nothing to do if we have no placeholders allocated
+	if resources.IsZero(sa.allocatedPlaceholder) {
+		return nil
+	}
+	// make sure the request are sorted
+	sa.sortRequests(false)
+	// keep the first fits for later
+	var phFit *Allocation
+	var reqFit *AllocationAsk
+	// get all the requests from the app sorted in order
+	for _, request := range sa.sortedRequests {
+		// skip placeholders they follow standard allocation
+		// this should also be part of a task group just make sure it is
+		if request.placeholder || request.taskGroupName == "" {
+			continue
+		}
+		// walk over the placeholders, allow for processing all as we can have multiple task groups
+		phAllocs := sa.getPlaceholderAllocations()
+		for _, ph := range phAllocs {
+			// we could have already released this placeholder and are waiting for the shim to confirm
+			// and check that we have the correct task group before trying to swap
+			if ph.released || request.taskGroupName != ph.taskGroupName {
+				continue
+			}
+			if phFit == nil && reqFit == nil {
+				phFit = ph
+				reqFit = request
+			}
+			// this is inefficient but needs to do for now as we only have the ID not the handle on the node itself
+			var ok bool
+			var node *Node
+			for iterator.HasNext() {
+				node, ok = iterator.Next().(*Node)
+				if !ok {
+					log.Logger().Warn("Node iterator failed to return a node")
+					return nil
+				}
+				if node.NodeID == ph.NodeID {
+					break
+				}
+			}
+			// got the node run same checks as for reservation (all but fits)
+			// resource usage should not change anyway between placeholder and real one
+			if node != nil && node.preReserveConditions(request.AllocationKey) {
+				alloc := NewAllocation(common.GetNewUUID(), node.NodeID, request)
+				// double link to make it easier to find
+				alloc.Releases = []*Allocation{ph}
+				ph.Releases = []*Allocation{alloc}
+				alloc.Result = Replaced
+				// mark placeholder as released
+				ph.released = true
+				_, err := sa.updateAskRepeatInternal(request, -1)
+				if err != nil {
+					log.Logger().Warn("ask repeat update failed unexpectedly",
+						zap.Error(err))
+				}
+				return alloc
+			}
+			iterator.Reset()
+		}
+	}
+	// we checked all placeholders and asks nothing worked as yet
+	// pick the first fit and try all nodes if that fails give up
+	if phFit != nil && reqFit != nil {
+		for iterator.HasNext() {
+			node, ok := iterator.Next().(*Node)
+			if !ok {
+				log.Logger().Warn("Node iterator failed to return a node")
+				return nil
+			}
+			if err := node.preAllocateCheck(reqFit.AllocatedResource, reservationKey(nil, sa, reqFit), false); err != nil {
+				continue
+			}
+			// skip the node if conditions can not be satisfied
+			if !node.preAllocateConditions(reqFit.AllocationKey) {
+				continue
+			}
+			// allocation worked: on a non placeholder node update result and return
+			alloc := NewAllocation(common.GetNewUUID(), node.NodeID, reqFit)
+			// double link to make it easier to find
+			alloc.Releases = []*Allocation{phFit}
+			phFit.Releases = []*Allocation{alloc}
+			alloc.Result = Replaced
+			// mark placeholder as released
+			phFit.released = true
+			// update just the node to make sure we keep its spot
+			// no queue update as we're releasing the placeholder and are just temp over the size
+			if !node.AddAllocation(alloc) {
+				log.Logger().Debug("Node update failed unexpectedly",
+					zap.String("applicationID", sa.ApplicationID),
+					zap.String("ask", reqFit.String()),
+					zap.String("placeholder", phFit.String()))
+				return nil

Review comment:
       hi @wilfred-s even with the approach in this PR, gang scheduling is not 100% guaranteed. If we can't utilize the placeholders for the real pods and schedule them on some other nodes. We do not know how many of them can be allocated successfully and how many will keep pending. It still breaks the gang scheduling semantics.
   
   The case I described above is pretty bad. The scheduler data should always be consistent, few dimensions are: used resources by all apps in a queue should equal to used resources of this queue. This is a more serious problem than breaking gang scheduling in the worst-case scenario.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-core] yangwwei commented on a change in pull request #236: [YUNIKORN-476] placeholder allocation and swap

Posted by GitBox <gi...@apache.org>.
yangwwei commented on a change in pull request #236:
URL: https://github.com/apache/incubator-yunikorn-core/pull/236#discussion_r556193890



##########
File path: pkg/scheduler/context.go
##########
@@ -111,38 +111,22 @@ func (cc *ClusterContext) schedule() {
 		if psc.isStopped() {
 			continue
 		}
-		// try reservations first
-		alloc := psc.tryReservedAllocate()
-		// nothing reserved that can be allocated try normal allocate
+		// placeholder replacement first
+		alloc := psc.tryPlaceholderAllocate()

Review comment:
       Current logic, `TryPlaceholderAllocate()` also can allocate a pod on a node that has no placeholder for it.
   If we run this first, that means it is possible the app with the gang will get resources first than the reservations (in `TryReservedAllocate()`) Should we still call `TryReservedAllocate()` first?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org