You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@yunikorn.apache.org by GitBox <gi...@apache.org> on 2020/06/12 05:59:47 UTC

[GitHub] [incubator-yunikorn-k8shim] yangwwei opened a new pull request #136: [YUNIKORN-220] Some nodes might be missing if the addNode event is missing

yangwwei opened a new pull request #136:
URL: https://github.com/apache/incubator-yunikorn-k8shim/pull/136


   Some times we found the scheduler state is being inconsistent with K8s. Looks like some addNode was not sent by the informer. In such case, we should make sure the node can be added as well via updateNode event.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-k8shim] yangwwei commented on pull request #136: [YUNIKORN-220] Some nodes might be missing if the addNode event is missing

Posted by GitBox <gi...@apache.org>.
yangwwei commented on pull request #136:
URL: https://github.com/apache/incubator-yunikorn-k8shim/pull/136#issuecomment-643502368


   Hi @wangdatan 
   
   I don't think this is a K8s bug, this is how it is designed to be.
   The shared index informer communicates with api-server and caches the states locally, YK registers some listeners to the informers, so YK gets notified when there are some resource changes by add/update/delete events. Since all of these things are happening in an async way, there is no guarantee of the event ordering, or sometimes event could be missing. (e.g due to api-server throttling). The shared index informer guarantees the eventual consistency, which means the object states will be eventually updated to the correct state. That could have some delay, or that might depend on some more updates.
   
   So I think we cannot simply rely on one `add` event to add the node. If this event is not sent by the informer, we run into issues. The fix is to do the ADD if we find a new node while getting the `update` event.
   
   Short answer, I think this is a corner case we need to handle, to improve resilience. @wangdatan , can you help to review the patch?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-k8shim] yangwwei edited a comment on pull request #136: [YUNIKORN-220] Some nodes might be missing if the addNode event is missing

Posted by GitBox <gi...@apache.org>.
yangwwei edited a comment on pull request #136:
URL: https://github.com/apache/incubator-yunikorn-k8shim/pull/136#issuecomment-643502368


   Hi @wangdatan 
   
   I don't think this is a K8s bug, this is how it is designed to be.
   
   The shared index informer communicates with api-server and caches the states locally, YK registers some listeners to the informers, so YK gets notified when there are some resource changes by add/update/delete events. Since all of these things are happening in an async way, there is no guarantee of the event ordering, or sometimes event could be missing. (e.g due to api-server throttling). The shared index informer guarantees the eventual consistency, which means the object states will be eventually updated to the correct state. That could have some delay, or that might depend on some more updates.
   
   So I think we cannot simply rely on one `add` event to add the node. If this event is not sent by the informer, we run into issues. The fix is to do the ADD if we find a new node while getting the `update` event.
   
   Short answer, I think this is a corner case we need to handle, to improve resilience. @wangdatan , can you help to review the patch?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-k8shim] wangdatan commented on pull request #136: [YUNIKORN-220] Some nodes might be missing if the addNode event is missing

Posted by GitBox <gi...@apache.org>.
wangdatan commented on pull request #136:
URL: https://github.com/apache/incubator-yunikorn-k8shim/pull/136#issuecomment-643497906


   Thanks @yangwwei for troubleshooting this issue, do you know why such events can miss from K8s? Is it a bug of K8s or a normal corner case we need to handle? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-k8shim] yangwwei merged pull request #136: [YUNIKORN-220] Some nodes might be missing if the addNode event is missing

Posted by GitBox <gi...@apache.org>.
yangwwei merged pull request #136:
URL: https://github.com/apache/incubator-yunikorn-k8shim/pull/136


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-k8shim] wangdatan commented on pull request #136: [YUNIKORN-220] Some nodes might be missing if the addNode event is missing

Posted by GitBox <gi...@apache.org>.
wangdatan commented on pull request #136:
URL: https://github.com/apache/incubator-yunikorn-k8shim/pull/136#issuecomment-643837820


   @yangwwei , I checked this area again in details: based on my research, the only possible area is we hit:
   
   https://github.com/kubernetes/kubernetes/issues/83810 
   
   If I understand it correctly, the only area it could happen is a remove followed by the add immediately, and API server could send an update instead of two events.
   
   The fix in default scheduler for this particular issue is: https://github.com/kubernetes/kubernetes/pull/91126
   
   It checks UID of the object when an update received. 
   
   I also did more research about how default scheduler handles node doesn't exist when update received, like we previously discovered. default scheduler invokes addNode when such thing happens, this behavior isn't changed for the last 4 years.
   
   However, I think it is not 100% safe to assume it is the case. I. suggest to print UID of old node when we received a node update, but the name isn't present in the map. Also, in the patch, nodes.go doesn't protect `getNode` operation under `updateNode`. I think it will be risky to create race condition, at least it is not a best practice. We need to take care of it as a follow up patch
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-k8shim] yangwwei commented on pull request #136: [YUNIKORN-220] Some nodes might be missing if the addNode event is missing

Posted by GitBox <gi...@apache.org>.
yangwwei commented on pull request #136:
URL: https://github.com/apache/incubator-yunikorn-k8shim/pull/136#issuecomment-643511703


   > Why we have two different node cache: scheduler_cache.go, and nodes.go
   
   The scheduler_cache in the `pkg/cache/external` is the cache for predicates functions, we need that because we reuse the K8s predicates implementation.
   
   > I think it is possible that some events missed, but I couldn't find there's a bug reported for that. So I still not felt it is safe to assume "add" events will eventually become "update"
   
   Update happens as long as there is a change in the resource, e.g state changes. So it happens multiple times in the resource's lifecycle. However `add` only happens once. If you look at how this was handled in the default scheduler, it is similar: https://github.com/kubernetes/kubernetes/blob/2402bfd4bc329ae755250f5214351b1408a48eab/pkg/scheduler/internal/cache/cache.go#L589-L591.
   
   > I think the order can be guaranteed based on comments from
   
   Yeah, I guess this is only true for the same object. We saw earlier that `addPod` (daemonset pod) event arrives before `addNode` where actually on K8s the node is created first.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-k8shim] wangdatan commented on pull request #136: [YUNIKORN-220] Some nodes might be missing if the addNode event is missing

Posted by GitBox <gi...@apache.org>.
wangdatan commented on pull request #136:
URL: https://github.com/apache/incubator-yunikorn-k8shim/pull/136#issuecomment-643510126


   I think the order can be guaranteed based on comments from https://github.com/kubernetes/client-go/blob/master/tools/cache/shared_informer.go (I think we need to go over the comment, there're lots of contents inside and I do not fully understand the context)
   
   ```
   For a given SharedInformer, client, and object ID, the notifications are delivered in order.
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-k8shim] wangdatan commented on pull request #136: [YUNIKORN-220] Some nodes might be missing if the addNode event is missing

Posted by GitBox <gi...@apache.org>.
wangdatan commented on pull request #136:
URL: https://github.com/apache/incubator-yunikorn-k8shim/pull/136#issuecomment-643511313


   Found one more related fix, it looks similar but I'm not sure: https://github.com/kubernetes/kubernetes/pull/83911/
   
   ```
   Fixed an issue with informers missing an `Added` event if a recently deleted object was immediately recreated at the same time the informer dropped a watch and relisted.
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-k8shim] wangdatan commented on pull request #136: [YUNIKORN-220] Some nodes might be missing if the addNode event is missing

Posted by GitBox <gi...@apache.org>.
wangdatan commented on pull request #136:
URL: https://github.com/apache/incubator-yunikorn-k8shim/pull/136#issuecomment-643521048


   Thanks @yangwwei, since this is inline with the default scheduler, I think this is a correct fix, the latest patch LGTM. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-k8shim] codecov-commenter commented on pull request #136: [YUNIKORN-220] Some nodes might be missing if the addNode event is missing

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on pull request #136:
URL: https://github.com/apache/incubator-yunikorn-k8shim/pull/136#issuecomment-643116937


   # [Codecov](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/136?src=pr&el=h1) Report
   > Merging [#136](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/136?src=pr&el=desc) into [master](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/commit/c7c0df2103d5b70d793a104dd441d2c24222dea6&el=desc) will **increase** coverage by `0.37%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/136/graphs/tree.svg?width=650&height=150&src=pr&token=LZImIuvleR)](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/136?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master     #136      +/-   ##
   ==========================================
   + Coverage   52.89%   53.26%   +0.37%     
   ==========================================
     Files          32       32              
     Lines        3025     3030       +5     
   ==========================================
   + Hits         1600     1614      +14     
   + Misses       1364     1355       -9     
     Partials       61       61              
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/136?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [pkg/cache/external/scheduler\_cache.go](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/136/diff?src=pr&el=tree#diff-cGtnL2NhY2hlL2V4dGVybmFsL3NjaGVkdWxlcl9jYWNoZS5nbw==) | `34.21% <100.00%> (+4.95%)` | :arrow_up: |
   | [pkg/cache/nodes.go](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/136/diff?src=pr&el=tree#diff-cGtnL2NhY2hlL25vZGVzLmdv) | `78.44% <100.00%> (+1.45%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/136?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/136?src=pr&el=footer). Last update [c7c0df2...b27741a](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/136?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-k8shim] codecov-commenter edited a comment on pull request #136: [YUNIKORN-220] Some nodes might be missing if the addNode event is missing

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #136:
URL: https://github.com/apache/incubator-yunikorn-k8shim/pull/136#issuecomment-643116937


   # [Codecov](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/136?src=pr&el=h1) Report
   > Merging [#136](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/136?src=pr&el=desc) into [master](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/commit/c7c0df2103d5b70d793a104dd441d2c24222dea6&el=desc) will **increase** coverage by `0.37%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/136/graphs/tree.svg?width=650&height=150&src=pr&token=LZImIuvleR)](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/136?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master     #136      +/-   ##
   ==========================================
   + Coverage   52.89%   53.26%   +0.37%     
   ==========================================
     Files          32       32              
     Lines        3025     3030       +5     
   ==========================================
   + Hits         1600     1614      +14     
   + Misses       1364     1355       -9     
     Partials       61       61              
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/136?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [pkg/cache/external/scheduler\_cache.go](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/136/diff?src=pr&el=tree#diff-cGtnL2NhY2hlL2V4dGVybmFsL3NjaGVkdWxlcl9jYWNoZS5nbw==) | `34.21% <100.00%> (+4.95%)` | :arrow_up: |
   | [pkg/cache/nodes.go](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/136/diff?src=pr&el=tree#diff-cGtnL2NhY2hlL25vZGVzLmdv) | `78.44% <100.00%> (+1.45%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/136?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/136?src=pr&el=footer). Last update [c7c0df2...5e31c1f](https://codecov.io/gh/apache/incubator-yunikorn-k8shim/pull/136?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-k8shim] wangdatan commented on pull request #136: [YUNIKORN-220] Some nodes might be missing if the addNode event is missing

Posted by GitBox <gi...@apache.org>.
wangdatan commented on pull request #136:
URL: https://github.com/apache/incubator-yunikorn-k8shim/pull/136#issuecomment-643509601


   Thanks @yangwwei for the explanation, I'm trying to understand following areas: 
   
   1) Why we have two different node cache: scheduler_cache.go, and nodes.go? 
   
   2) I spent a bit time to look at community issues, here are a few: 
   
   https://github.com/kubernetes/kubernetes/issues/83810
   
   https://github.com/kubernetes/kubernetes/pull/91126
   
   I think it is possible that some events missed, but I couldn't find there's a bug reported for that. So I still not felt it is safe to assume "add" events will eventually become "update" 
   
   And even we assume it is true, do we need such handling for other objects like POD, namespace, or any other k8s objects we watched. I just want to make sure we won't see the same issue in the future. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-k8shim] wangdatan commented on pull request #136: [YUNIKORN-220] Some nodes might be missing if the addNode event is missing

Posted by GitBox <gi...@apache.org>.
wangdatan commented on pull request #136:
URL: https://github.com/apache/incubator-yunikorn-k8shim/pull/136#issuecomment-643514535


   I think this fix looks related to the issue you saw: 
   
   https://github.com/kubernetes/client-go/commit/b775e00fe53e0b30886ba5faf7885be0db390676#diff-1ff95943917d4fd6146cc18fa27b320dR194
   
   > Yeah, I guess this is only true for the same object. We saw earlier that addPod (daemonset pod) event arrives before addNode where actually on K8s the node is created first.
   
   So how we handle this case? Assume a pod is added before the node added, will it cause our scheduler to go to some limbo state? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-k8shim] yangwwei commented on pull request #136: [YUNIKORN-220] Some nodes might be missing if the addNode event is missing

Posted by GitBox <gi...@apache.org>.
yangwwei commented on pull request #136:
URL: https://github.com/apache/incubator-yunikorn-k8shim/pull/136#issuecomment-643516131


   > So how we handle this case? Assume a pod is added before the node added, will it cause our scheduler to go to some limbo state?
   
   YK itself is not tightly depending on the order of pods and nodes. If node is not registered yet, the pod will be pending until that node is added. So this will not cause any problems for us.
   But it was causing some problems in the external cache. Pls see the fix https://github.com/apache/incubator-yunikorn-k8shim/pull/123. That would lead up issues to run predicate functions.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org