You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@yunikorn.apache.org by GitBox <gi...@apache.org> on 2021/01/27 22:40:03 UTC

[GitHub] [incubator-yunikorn-k8shim] yangwwei opened a new pull request #224: [YUNIKORN-518] Placeholder manager failed to init during scheduler recovery

yangwwei opened a new pull request #224:
URL: https://github.com/apache/incubator-yunikorn-k8shim/pull/224


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-k8shim] wilfred-s closed pull request #224: [YUNIKORN-518] Placeholder manager failed to init during scheduler recovery

Posted by GitBox <gi...@apache.org>.
wilfred-s closed pull request #224:
URL: https://github.com/apache/incubator-yunikorn-k8shim/pull/224


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-k8shim] codecov[bot] edited a comment on pull request #224: [YUNIKORN-518] Placeholder manager failed to init during scheduler recovery

Posted by GitBox <gi...@apache.org>.
codecov[bot] edited a comment on pull request #224:
URL: https://github.com/apache/incubator-yunikorn-k8shim/pull/224#issuecomment-768904135


   # [Codecov](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/224?src=pr&el=h1) Report
   > Merging [#224](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/224?src=pr&el=desc) (750264d) into [master](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/commit/79044e6a6823eb6f07c5c8406c4df63b65bd0843?el=desc) (79044e6) will **increase** coverage by `0.03%`.
   > The diff coverage is `66.66%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/224/graphs/tree.svg?width=650&height=150&src=pr&token=LZImIuvleR)](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/224?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master     #224      +/-   ##
   ==========================================
   + Coverage   59.71%   59.75%   +0.03%     
   ==========================================
     Files          35       35              
     Lines        3135     3133       -2     
   ==========================================
     Hits         1872     1872              
   + Misses       1181     1180       -1     
   + Partials       82       81       -1     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/224?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [pkg/cache/application.go](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/224/diff?src=pr&el=tree#diff-cGtnL2NhY2hlL2FwcGxpY2F0aW9uLmdv) | `76.74% <0.00%> (ø)` | |
   | [pkg/cache/placeholder\_manager.go](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/224/diff?src=pr&el=tree#diff-cGtnL2NhY2hlL3BsYWNlaG9sZGVyX21hbmFnZXIuZ28=) | `79.68% <ø> (+2.07%)` | :arrow_up: |
   | [pkg/shim/scheduler.go](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/224/diff?src=pr&el=tree#diff-cGtnL3NoaW0vc2NoZWR1bGVyLmdv) | `79.56% <100.00%> (+0.15%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/224?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/224?src=pr&el=footer). Last update [79044e6...750264d](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/224?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-k8shim] wilfred-s commented on pull request #224: [YUNIKORN-518] Placeholder manager failed to init during scheduler recovery

Posted by GitBox <gi...@apache.org>.
wilfred-s commented on pull request #224:
URL: https://github.com/apache/incubator-yunikorn-k8shim/pull/224#issuecomment-768871332


   No, if we cannot work without a `PlaceholderManager` initialised we should not start up. We need tests for that.
   What you are doing now is that we crash the scheduler when, by chance, a task group is requested by an application. That might not come for minutes, hours or even days after starting up. So for no reason you could crash the scheduler after running for days without an issue.
   That is not acceptable, and a really bad user experience.
   
   I agree that the fix is to correctly and always initialise the `PlaceholderManager`. However since we have no `nil` checks in the code for `GetPlacholderManager()` we need to make sure that we take down the scheduler immediately if it is not initialised correctly. We do the same when the `AppManagementService` does not start correctly.
   Creation of the manager should be part of the `newShimScheduler()` and then in the run we need to check it (not sure if there is anything we need to do beside a simple not nil check) and start it. When it is not correct we call `ss.stop()` which  should also be nil safe to not cause a panic on shutdown.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-k8shim] codecov[bot] commented on pull request #224: [YUNIKORN-518] Placeholder manager failed to init during scheduler recovery

Posted by GitBox <gi...@apache.org>.
codecov[bot] commented on pull request #224:
URL: https://github.com/apache/incubator-yunikorn-k8shim/pull/224#issuecomment-768904135


   # [Codecov](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/224?src=pr&el=h1) Report
   > Merging [#224](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/224?src=pr&el=desc) (ee6ca1f) into [master](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/commit/79044e6a6823eb6f07c5c8406c4df63b65bd0843?el=desc) (79044e6) will **not change** coverage.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/224/graphs/tree.svg?width=650&height=150&src=pr&token=LZImIuvleR)](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/224?src=pr&el=tree)
   
   ```diff
   @@           Coverage Diff           @@
   ##           master     #224   +/-   ##
   =======================================
     Coverage   59.71%   59.71%           
   =======================================
     Files          35       35           
     Lines        3135     3135           
   =======================================
     Hits         1872     1872           
     Misses       1181     1181           
     Partials       82       82           
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/224?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [pkg/shim/scheduler.go](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/224/diff?src=pr&el=tree#diff-cGtnL3NoaW0vc2NoZWR1bGVyLmdv) | `79.41% <100.00%> (ø)` | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/224?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/224?src=pr&el=footer). Last update [79044e6...ee6ca1f](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/224?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-k8shim] yangwwei commented on pull request #224: [YUNIKORN-518] Placeholder manager failed to init during scheduler recovery

Posted by GitBox <gi...@apache.org>.
yangwwei commented on pull request #224:
URL: https://github.com/apache/incubator-yunikorn-k8shim/pull/224#issuecomment-768793771


   hi @wilfred-s the purpose of adding that FATAL error is to make sure we can crash the scheduler when placeholder manager was not initialized. And the reason we hit this issue was that the `NewPlaceholderManager()` was not called after `GetPlaceholderManager()` during recovery, so it crashed. This should never happen in production. If we remove that and we will get an NPE here, similar but with the FATAL error, we know definitely something goes wrong.
   
   We can check the return value of `NewPlaceholderManager()` but that's really not the root cause. The fix is to adjust the order of services' starting to make sure the `NewPlaceholderManager()` gets called before `GetPlaceholderManager()`.
   
   Does that make sense?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-k8shim] codecov[bot] edited a comment on pull request #224: [YUNIKORN-518] Placeholder manager failed to init during scheduler recovery

Posted by GitBox <gi...@apache.org>.
codecov[bot] edited a comment on pull request #224:
URL: https://github.com/apache/incubator-yunikorn-k8shim/pull/224#issuecomment-768904135


   # [Codecov](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/224?src=pr&el=h1) Report
   > Merging [#224](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/224?src=pr&el=desc) (ee6ca1f) into [master](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/commit/79044e6a6823eb6f07c5c8406c4df63b65bd0843?el=desc) (79044e6) will **not change** coverage.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/224/graphs/tree.svg?width=650&height=150&src=pr&token=LZImIuvleR)](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/224?src=pr&el=tree)
   
   ```diff
   @@           Coverage Diff           @@
   ##           master     #224   +/-   ##
   =======================================
     Coverage   59.71%   59.71%           
   =======================================
     Files          35       35           
     Lines        3135     3135           
   =======================================
     Hits         1872     1872           
     Misses       1181     1181           
     Partials       82       82           
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/224?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [pkg/shim/scheduler.go](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/224/diff?src=pr&el=tree#diff-cGtnL3NoaW0vc2NoZWR1bGVyLmdv) | `79.41% <100.00%> (ø)` | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/224?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/224?src=pr&el=footer). Last update [79044e6...750264d](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/224?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org