You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@yunikorn.apache.org by GitBox <gi...@apache.org> on 2022/12/01 16:41:46 UTC

[GitHub] [yunikorn-core] bgrams opened a new pull request, #463: [YUNIKORN-1440] Cleanup terminated applications once expired

bgrams opened a new pull request, #463:
URL: https://github.com/apache/yunikorn-core/pull/463

   ### What is this PR for?
   A few sentences describing the overall goals of the pull request's commits.
   First time? Check out the contributing guide - http://yunikorn.apache.org/community/how_to_contribute   
   
   Allows terminated applications to be cleaned up once marked as expired. The `completedApplications` map otherwise grows indefinitely and may lead to memory pressure.
   
   ### What type of PR is it?
   * [x] - Bug Fix
   * [ ] - Improvement
   * [ ] - Feature
   * [ ] - Documentation
   * [ ] - Hot Fix
   * [ ] - Refactoring
   
   ### Todos
   * [ ] - Task
   
   ### What is the Jira issue?
   * Open an issue on Jira https://issues.apache.org/jira/browse/YUNIKORN/
   * Put link here, and add [YUNIKORN-*Jira number*] in PR title, eg. `[YUNIKORN-2] Gang scheduling interface parameters`
   https://issues.apache.org/jira/browse/YUNIKORN-1440
   
   ### How should this be tested?
   Unit test added
   
   ### Screenshots (if appropriate)
   
   ### Questions:
   * [ ] - The licenses files need update.
   * [ ] - There is breaking changes for older versions.
   * [ ] - It needs documentation.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [yunikorn-core] bgrams commented on pull request #463: [YUNIKORN-1440] Cleanup terminated applications once expired

Posted by GitBox <gi...@apache.org>.
bgrams commented on PR #463:
URL: https://github.com/apache/yunikorn-core/pull/463#issuecomment-1334780956

   > Looking at the TestCleanupCompletedApps it is even more broken than what you have fixed.
   > 
   > The proper test should have two apps in the `completedApplications` map. One is in the Completed state, and the second one is in the Expired state. Running the cleanupExpiredApps() should remove one and not the other. Both apps should have moved from Completing to Completed using the event. The `partition.applications` list should be 2 before the events and 0 after the events. Vice versa for the `partition.completedApplications` list
   
   Addressed in [26e126a](https://github.com/apache/yunikorn-core/pull/463/commits/26e126a649331bac065b146e3bd08c9793e7f8ac)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [yunikorn-core] codecov[bot] commented on pull request #463: [YUNIKORN-1440] Cleanup terminated applications once expired

Posted by GitBox <gi...@apache.org>.
codecov[bot] commented on PR #463:
URL: https://github.com/apache/yunikorn-core/pull/463#issuecomment-1334283123

   # [Codecov](https://codecov.io/gh/apache/yunikorn-core/pull/463?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#463](https://codecov.io/gh/apache/yunikorn-core/pull/463?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (acf9226) into [master](https://codecov.io/gh/apache/yunikorn-core/commit/9b90e8ad9f834868971b36cd69284a03532de999?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (9b90e8a) will **decrease** coverage by `0.09%`.
   > The diff coverage is `100.00%`.
   
   ```diff
   @@            Coverage Diff             @@
   ##           master     #463      +/-   ##
   ==========================================
   - Coverage   72.81%   72.71%   -0.10%     
   ==========================================
     Files          67       67              
     Lines        9990    10017      +27     
   ==========================================
   + Hits         7274     7284      +10     
   - Misses       2474     2486      +12     
   - Partials      242      247       +5     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/yunikorn-core/pull/463?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [pkg/scheduler/partition.go](https://codecov.io/gh/apache/yunikorn-core/pull/463/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGtnL3NjaGVkdWxlci9wYXJ0aXRpb24uZ28=) | `74.60% <100.00%> (-0.61%)` | :arrow_down: |
   | [pkg/scheduler/ugm/queue\_tracker.go](https://codecov.io/gh/apache/yunikorn-core/pull/463/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGtnL3NjaGVkdWxlci91Z20vcXVldWVfdHJhY2tlci5nbw==) | `82.00% <0.00%> (-2.85%)` | :arrow_down: |
   | [pkg/webservice/handlers.go](https://codecov.io/gh/apache/yunikorn-core/pull/463/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGtnL3dlYnNlcnZpY2UvaGFuZGxlcnMuZ28=) | `78.34% <0.00%> (-0.29%)` | :arrow_down: |
   | [pkg/scheduler/objects/simplepreemptor.go](https://codecov.io/gh/apache/yunikorn-core/pull/463/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGtnL3NjaGVkdWxlci9vYmplY3RzL3NpbXBsZXByZWVtcHRvci5nbw==) | | |
   | [pkg/scheduler/objects/required\_node\_preemptor.go](https://codecov.io/gh/apache/yunikorn-core/pull/463/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGtnL3NjaGVkdWxlci9vYmplY3RzL3JlcXVpcmVkX25vZGVfcHJlZW1wdG9yLmdv) | `98.83% <0.00%> (ø)` | |
   
   :mega: We’re building smart automated test selection to slash your CI/CD build times. [Learn more](https://about.codecov.io/iterative-testing/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [yunikorn-core] bgrams commented on a diff in pull request #463: [YUNIKORN-1440] Cleanup terminated applications once expired

Posted by GitBox <gi...@apache.org>.
bgrams commented on code in PR #463:
URL: https://github.com/apache/yunikorn-core/pull/463#discussion_r1037797696


##########
pkg/scheduler/partition.go:
##########
@@ -1433,6 +1433,13 @@ func (pc *PartitionContext) cleanupExpiredApps() {
 		delete(pc.rejectedApplications, app.ApplicationID)
 		pc.Unlock()
 	}
+	for k, app := range pc.completedApplications {
+		pc.Lock()
+		if app.IsExpired() {
+			delete(pc.completedApplications, k)
+		}
+		pc.Unlock()
+	}

Review Comment:
   Thanks Wilfred, addressed in [209b0dc](https://github.com/apache/yunikorn-core/pull/463/commits/209b0dc7d4822d3e8cea3d3571eb13121ba39d17)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [yunikorn-core] bgrams commented on a diff in pull request #463: [YUNIKORN-1440] Cleanup terminated applications once expired

Posted by GitBox <gi...@apache.org>.
bgrams commented on code in PR #463:
URL: https://github.com/apache/yunikorn-core/pull/463#discussion_r1037805397


##########
pkg/scheduler/partition_test.go:
##########
@@ -1681,39 +1698,62 @@ func TestCompleteApp(t *testing.T) {
 	assert.Assert(t, len(partition.applications) == 1, "the partition should have 1 app")
 	assert.Assert(t, len(partition.completedApplications) == 0, "the partition should not have any completed apps")
 	// complete the application
-	err = app.HandleApplicationEvent(objects.CompleteApplication)
-	assert.NilError(t, err, "no error expected while transitioning the app from waiting to completed state")
-	err = common.WaitFor(10*time.Millisecond, time.Duration(1000)*time.Millisecond, func() bool {
-		partition.RLock()
-		defer partition.RUnlock()
-		return len(partition.completedApplications) > 0
-	})
+	err = completeApplicationAndWait(app, partition)
 	assert.NilError(t, err, "the completed application should have been processed")
 	assert.Assert(t, len(partition.applications) == 0, "the partition should have no active app")
 	assert.Assert(t, len(partition.completedApplications) == 1, "the partition should have 1 completed app")
 }
 
+func TestCleanupApp(t *testing.T) {
+	partition, err := newBasePartition()
+	assert.NilError(t, err, "partition create failed")
+	newApp1 := newApplication("newApp1", "default", defQueue)
+	newApp2 := newApplication("newApp2", "default", defQueue)
+
+	err = partition.AddApplication(newApp1)
+	assert.NilError(t, err, "no error expected while adding the app")
+	err = partition.AddApplication(newApp2)
+	assert.NilError(t, err, "no error expected while adding the app")
+
+	assert.Assert(t, len(partition.applications) == 2, "the partition should have 2 apps")
+
+	newApp1.SetState(objects.Expired.String())
+	partition.cleanupExpiredApps()
+
+	assert.Assert(t, len(partition.applications) == 1, "the partition should have 1 app")
+	assert.Assert(t, len(partition.GetAppsByState(objects.Expired.String())) == 0, "the partition should have 0 expired apps")
+}
+
 func TestCleanupCompletedApps(t *testing.T) {
 	partition, err := newBasePartition()
 	assert.NilError(t, err, "partition create failed")
-	completedApp := newApplication("completed", "default", defQueue)
-	completedApp.SetState(objects.Completed.String())
+	completedApp1 := newApplication("completedApp1", "default", defQueue)
+	completedApp2 := newApplication("completedApp2", "default", defQueue)
 
-	newApp := newApplication("running", "default", defQueue)
-	err = partition.AddApplication(completedApp)
-	assert.NilError(t, err, "no error expected while adding the application")
-	err = partition.AddApplication(newApp)
-	assert.NilError(t, err, "no error expected while adding the application")
+	err = partition.AddApplication(completedApp1)
+	assert.NilError(t, err, "no error expected while adding the app")
+	err = partition.AddApplication(completedApp2)
+	assert.NilError(t, err, "no error expected while adding the app")
 
 	assert.Assert(t, len(partition.applications) == 2, "the partition should have 2 apps")
+	assert.Assert(t, len(partition.completedApplications) == 0, "the partition should have 0 completed apps")
+
+	// complete the applications using the event system
+	completedApp1.SetState(objects.Completing.String())
+	err = completeApplicationAndWait(completedApp1, partition)
+	assert.NilError(t, err, "the completed application should have been processed")
+	completedApp2.SetState(objects.Completing.String())
+	err = completeApplicationAndWait(completedApp2, partition)
+	assert.NilError(t, err, "the completed application should have been processed")
+
+	assert.Assert(t, len(partition.applications) == 0, "the partition should have 0 apps")
+	assert.Assert(t, len(partition.completedApplications) == 2, "the partition should have 2 completed apps")
+
 	// mark the app for removal
-	completedApp.SetState(objects.Expired.String())
+	completedApp1.SetState(objects.Expired.String())
 	partition.cleanupExpiredApps()
-	assert.Assert(t, len(partition.applications) == 1, "the partition should have 1 app")
-	assert.Assert(t, partition.getApplication(completedApp.ApplicationID) == nil, "completed application should have been deleted")
-	assert.Assert(t, partition.getApplication(newApp.ApplicationID) != nil, "new application should still be in the partition")
-	assert.Assert(t, len(partition.GetAppsByState(objects.Completed.String())) == 0, "the partition should have 0 completed app")
-	assert.Assert(t, len(partition.GetAppsByState(objects.Expired.String())) == 0, "the partition should have 0 expired app")

Review Comment:
   Noting that the removal of these is covered by the check in [L.1749](https://github.com/apache/yunikorn-core/pull/463/files#diff-2edfc3fe8b3fca0a3a2a8a3924a0c5a078e3f8e9046748067d65248b458aae6aR1749) as well as the refactor of [`TestCleanupApp`](https://github.com/apache/yunikorn-core/pull/463/files#diff-2edfc3fe8b3fca0a3a2a8a3924a0c5a078e3f8e9046748067d65248b458aae6aR1707) with a similar test structure. I'm open to feedback on this refactor.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [yunikorn-core] wilfred-s commented on a diff in pull request #463: [YUNIKORN-1440] Cleanup terminated applications once expired

Posted by GitBox <gi...@apache.org>.
wilfred-s commented on code in PR #463:
URL: https://github.com/apache/yunikorn-core/pull/463#discussion_r1038166217


##########
pkg/scheduler/partition.go:
##########
@@ -1433,6 +1445,13 @@ func (pc *PartitionContext) cleanupExpiredApps() {
 		delete(pc.rejectedApplications, app.ApplicationID)
 		pc.Unlock()
 	}
+	for k, app := range pc.GetCompletedAppsByState(objects.Expired.String()) {
+		pc.Lock()
+		if app.IsExpired() {

Review Comment:
   the list is already filtered no need to check again.



##########
pkg/scheduler/partition.go:
##########
@@ -1433,6 +1445,13 @@ func (pc *PartitionContext) cleanupExpiredApps() {
 		delete(pc.rejectedApplications, app.ApplicationID)
 		pc.Unlock()
 	}
+	for k, app := range pc.GetCompletedAppsByState(objects.Expired.String()) {

Review Comment:
   you are not using app, should be a _



##########
pkg/scheduler/partition_test.go:
##########
@@ -1681,39 +1698,62 @@ func TestCompleteApp(t *testing.T) {
 	assert.Assert(t, len(partition.applications) == 1, "the partition should have 1 app")
 	assert.Assert(t, len(partition.completedApplications) == 0, "the partition should not have any completed apps")
 	// complete the application
-	err = app.HandleApplicationEvent(objects.CompleteApplication)
-	assert.NilError(t, err, "no error expected while transitioning the app from waiting to completed state")
-	err = common.WaitFor(10*time.Millisecond, time.Duration(1000)*time.Millisecond, func() bool {
-		partition.RLock()
-		defer partition.RUnlock()
-		return len(partition.completedApplications) > 0
-	})
+	err = completeApplicationAndWait(app, partition)
 	assert.NilError(t, err, "the completed application should have been processed")
 	assert.Assert(t, len(partition.applications) == 0, "the partition should have no active app")
 	assert.Assert(t, len(partition.completedApplications) == 1, "the partition should have 1 completed app")
 }
 
+func TestCleanupApp(t *testing.T) {

Review Comment:
   This should be called `TestCleanupFailedApps()` Those are the only ones that could be in the applications map with an expired state.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [yunikorn-core] wilfred-s commented on a diff in pull request #463: [YUNIKORN-1440] Cleanup terminated applications once expired

Posted by GitBox <gi...@apache.org>.
wilfred-s commented on code in PR #463:
URL: https://github.com/apache/yunikorn-core/pull/463#discussion_r1037754703


##########
pkg/scheduler/partition.go:
##########
@@ -1433,6 +1433,13 @@ func (pc *PartitionContext) cleanupExpiredApps() {
 		delete(pc.rejectedApplications, app.ApplicationID)
 		pc.Unlock()
 	}
+	for k, app := range pc.completedApplications {
+		pc.Lock()
+		if app.IsExpired() {
+			delete(pc.completedApplications, k)
+		}
+		pc.Unlock()
+	}

Review Comment:
   This will cause a data race. The completedApplications map can be manipulated from a different go routine. Maps are not go routine safe. You will need to introduce a `pc.GetCompletedAppsByState()` method to do the filtering under a lock like we do for rejected apps.
   The delete will still need to be within the lock.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [yunikorn-core] craigcondit closed pull request #463: [YUNIKORN-1440] Cleanup terminated applications once expired

Posted by GitBox <gi...@apache.org>.
craigcondit closed pull request #463: [YUNIKORN-1440] Cleanup terminated applications once expired
URL: https://github.com/apache/yunikorn-core/pull/463


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org