You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@skywalking.apache.org by GitBox <gi...@apache.org> on 2022/10/19 13:12:33 UTC
[GitHub] [skywalking] kezhenxu94 opened a new issue, #9814: [Bug] Cluster coordinator should be responsive to instance up/down
kezhenxu94 opened a new issue, #9814:
URL: https://github.com/apache/skywalking/issues/9814
### Search before asking
- [X] I had searched in the [issues](https://github.com/apache/skywalking/issues?q=is%3Aissue) and found no similar issues.
### Apache SkyWalking Component
OAP server (apache/skywalking)
### What happened
Currently the cluster coordinator is using polling strategy to fetch the OAP instances, and the interval is 5s:
https://github.com/apache/skywalking/blob/34cfafe398e80ca1a7299e1243827937c1a691dd/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/remote/client/RemoteClientManager.java#L96
There is a common case when some of the OAP instances restarted/shutdown, the living ones will still try to connect to those dead instances in the interval, causing error logs like this:
```
2022-10-18 07:33:28,297 - org.apache.skywalking.oap.server.core.remote.client.GRPCRemoteClient -31025 [grpc-default-executor-0] ERROR [] - UNAVAILABLE: io exception
io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
at io.grpc.Status.asRuntimeException(Status.java:535) ~[grpc-api-1.49.0.jar:1.49.0]
at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:487) [grpc-stub-1.49.0.jar:1.49.0]
at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:563) [grpc-core-1.49.0.jar:1.49.0]
at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70) [grpc-core-1.49.0.jar:1.49.0]
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:744) [grpc-core-1.49.0.jar:1.49.0]
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:723) [grpc-core-1.49.0.jar:1.49.0]
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) [grpc-core-1.49.0.jar:1.49.0]
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133) [grpc-core-1.49.0.jar:1.49.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
at java.lang.Thread.run(Unknown Source) [?:?]
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: /10.92.1.254:31800
Caused by: java.net.ConnectException: finishConnect(..) failed: Connection refused
at io.netty.channel.unix.Errors.newConnectException0(Errors.java:155) ~[netty-transport-native-unix-common-4.1.81.Final-linux-x86_64.jar:4.1.81.Final]
at io.netty.channel.unix.Errors.handleConnectErrno(Errors.java:128) ~[netty-transport-native-unix-common-4.1.81.Final-linux-x86_64.jar:4.1.81.Final]
at io.netty.channel.unix.Socket.finishConnect(Socket.java:359) ~[netty-transport-native-unix-common-4.1.81.Final-linux-x86_64.jar:4.1.81.Final]
at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:710) ~[netty-transport-classes-epoll-4.1.81.Final.jar:4.1.81.Final]
at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:687) ~[netty-transport-classes-epoll-4.1.81.Final.jar:4.1.81.Final]
at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:567) ~[netty-transport-classes-epoll-4.1.81.Final.jar:4.1.81.Final]
at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:489) ~[netty-transport-classes-epoll-4.1.81.Final.jar:4.1.81.Final]
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:397) ~[netty-transport-classes-epoll-4.1.81.Final.jar:4.1.81.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[netty-common-4.1.81.Final.jar:4.1.81.Final]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.81.Final.jar:4.1.81.Final]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.1.81.Final.jar:4.1.81.Final]
... 1 more
```
For most service-discover service like Kubernetes they should have a watcher/listener mode that clients can register to the changes of instances and get notified right after the instances' states changed, it's also easier to wrap the polling mechanism and expose as a listener mechanism for those that doesn't support listener mode.
### What you expected to happen
The OAP should be more responsive to the changes of the cluster instances states. Reducing unnecessary errors.
### How to reproduce
Start OAP in Kubernetes cluster mode, set the replicas to more than 1, after the OAP is ready, restart one of the OAP instance, and observe the logs in the other living OAP instances, there should be error logs as posted above.
### Anything else
_No response_
### Are you willing to submit PR?
- [X] Yes I am willing to submit a PR!
### Code of Conduct
- [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [skywalking] kezhenxu94 commented on issue #9814: [Bug] Cluster coordinator should be responsive to instance up/down
Posted by GitBox <gi...@apache.org>.
kezhenxu94 commented on issue #9814:
URL: https://github.com/apache/skywalking/issues/9814#issuecomment-1336140836
Just contacted @fgksgf and he thinks he has very limited time to work on this so I'm closing this in favor of https://github.com/apache/skywalking/issues/10076
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [skywalking] kezhenxu94 commented on issue #9814: [Bug] Cluster coordinator should be responsive to instance up/down
Posted by GitBox <gi...@apache.org>.
kezhenxu94 commented on issue #9814:
URL: https://github.com/apache/skywalking/issues/9814#issuecomment-1283994155
Assigning to @fgksgf as well since I discussed with him long time ago
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [skywalking] kezhenxu94 closed issue #9814: [Bug] Cluster coordinator should be responsive to instance up/down
Posted by GitBox <gi...@apache.org>.
kezhenxu94 closed issue #9814: [Bug] Cluster coordinator should be responsive to instance up/down
URL: https://github.com/apache/skywalking/issues/9814
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org