You are viewing a plain text version of this content. The canonical link for it is here.

Posted to notifications@skywalking.apache.org by GitBox <gi...@apache.org> on 2022/10/19 13:12:33 UTC

[GitHub] [skywalking] kezhenxu94 opened a new issue, #9814: [Bug] Cluster coordinator should be responsive to instance up/down

kezhenxu94 opened a new issue, #9814:
URL: https://github.com/apache/skywalking/issues/9814

   ### Search before asking
   
   - [X] I had searched in the [issues](https://github.com/apache/skywalking/issues?q=is%3Aissue) and found no similar issues.
   
   
   ### Apache SkyWalking Component
   
   OAP server (apache/skywalking)
   
   ### What happened
   
   Currently the cluster coordinator is using polling strategy to fetch the OAP instances, and the interval is 5s:
   
   https://github.com/apache/skywalking/blob/34cfafe398e80ca1a7299e1243827937c1a691dd/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/remote/client/RemoteClientManager.java#L96
   
   There is a common case when some of the OAP instances restarted/shutdown, the living ones will still try to connect to those dead instances in the interval, causing error logs like this:
   
   ```
   2022-10-18 07:33:28,297 - org.apache.skywalking.oap.server.core.remote.client.GRPCRemoteClient -31025 [grpc-default-executor-0] ERROR [] - UNAVAILABLE: io exception
   io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
   	at io.grpc.Status.asRuntimeException(Status.java:535) ~[grpc-api-1.49.0.jar:1.49.0]
   	at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:487) [grpc-stub-1.49.0.jar:1.49.0]
   	at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:563) [grpc-core-1.49.0.jar:1.49.0]
   	at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70) [grpc-core-1.49.0.jar:1.49.0]
   	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:744) [grpc-core-1.49.0.jar:1.49.0]
   	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:723) [grpc-core-1.49.0.jar:1.49.0]
   	at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) [grpc-core-1.49.0.jar:1.49.0]
   	at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133) [grpc-core-1.49.0.jar:1.49.0]
   	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
   	at java.lang.Thread.run(Unknown Source) [?:?]
   Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: /10.92.1.254:31800
   Caused by: java.net.ConnectException: finishConnect(..) failed: Connection refused
   	at io.netty.channel.unix.Errors.newConnectException0(Errors.java:155) ~[netty-transport-native-unix-common-4.1.81.Final-linux-x86_64.jar:4.1.81.Final]
   	at io.netty.channel.unix.Errors.handleConnectErrno(Errors.java:128) ~[netty-transport-native-unix-common-4.1.81.Final-linux-x86_64.jar:4.1.81.Final]
   	at io.netty.channel.unix.Socket.finishConnect(Socket.java:359) ~[netty-transport-native-unix-common-4.1.81.Final-linux-x86_64.jar:4.1.81.Final]
   	at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:710) ~[netty-transport-classes-epoll-4.1.81.Final.jar:4.1.81.Final]
   	at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:687) ~[netty-transport-classes-epoll-4.1.81.Final.jar:4.1.81.Final]
   	at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:567) ~[netty-transport-classes-epoll-4.1.81.Final.jar:4.1.81.Final]
   	at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:489) ~[netty-transport-classes-epoll-4.1.81.Final.jar:4.1.81.Final]
   	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:397) ~[netty-transport-classes-epoll-4.1.81.Final.jar:4.1.81.Final]
   	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[netty-common-4.1.81.Final.jar:4.1.81.Final]
   	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.81.Final.jar:4.1.81.Final]
   	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.1.81.Final.jar:4.1.81.Final]
   	... 1 more
   ```
   
   For most service-discover service like Kubernetes they should have a watcher/listener mode that clients can register to the changes of instances and get notified right after the instances' states changed, it's also easier to wrap the polling mechanism and expose as a listener mechanism for those that doesn't support listener mode.
   
   ### What you expected to happen
   
   The OAP should be more responsive to the changes of the cluster instances states. Reducing unnecessary errors.
   
   ### How to reproduce
   
   Start OAP in Kubernetes cluster mode, set the replicas to more than 1, after the OAP is ready, restart one of the OAP instance, and observe the logs in the other living OAP instances, there should be error logs as posted above.
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [skywalking] kezhenxu94 commented on issue #9814: [Bug] Cluster coordinator should be responsive to instance up/down

Posted by GitBox <gi...@apache.org>.

kezhenxu94 commented on issue #9814:
URL: https://github.com/apache/skywalking/issues/9814#issuecomment-1336140836

   Just contacted @fgksgf and he thinks he has very limited time to work on this so I'm closing this in favor of https://github.com/apache/skywalking/issues/10076 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [skywalking] kezhenxu94 commented on issue #9814: [Bug] Cluster coordinator should be responsive to instance up/down

Posted by GitBox <gi...@apache.org>.

kezhenxu94 commented on issue #9814:
URL: https://github.com/apache/skywalking/issues/9814#issuecomment-1283994155

   Assigning to @fgksgf as well since I discussed with him long time ago


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [skywalking] kezhenxu94 closed issue #9814: [Bug] Cluster coordinator should be responsive to instance up/down

Posted by GitBox <gi...@apache.org>.

kezhenxu94 closed issue #9814: [Bug] Cluster coordinator should be responsive to instance up/down
URL: https://github.com/apache/skywalking/issues/9814


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org