You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Alexis Sarda-Espinosa <sa...@gmail.com> on 2023/04/21 21:49:22 UTC

Kubernetes operator stops responding due to Connection reset by peer

Hello,

Today, we received an alert because the operator appeared to be down. Upon
further investigation, we realized the alert was triggered because the
endpoint for Prometheus metrics (which we enabled) stopped responding, so
it seems the endpoint used for the liveness probe wasn't affected and the
pod was not restarted automatically.

The logs right before the problem started don't show anything odd, and once
the problem started, the logs were spammed with warning messages stating
"Connection reset by peer" with no further information. From what I can
see, nothing else was logged during that time, so it looks like the process
really had stalled.

I imagine this is not easy to reproduce and, while a pod restart was enough
to get back on track, it might be worth improving the liveness probe to
catch these situations.

Full stacktrace for reference:

An exceptionCaught() event was fired, and it reached at the tail of the
pipeline. It usually means the last handler in the pipeline did not handle
the exception.
java.io.IOException: Connection reset by peer at
java.base/sun.nio.ch.FileDispatcherImpl.read0(Native Method) at
java.base/sun.nio.ch.SocketDispatcher.read(Unknown Source) at
java.base/sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source) at
java.base/sun.nio.ch.IOUtil.read(Unknown Source) at
java.base/sun.nio.ch.IOUtil.read(Unknown Source) at
java.base/sun.nio.ch.SocketChannelImpl.read(Unknown Source) at
org.apache.flink.shaded.netty4.io.netty.buffer.UnpooledDirectByteBuf.setBytes(UnpooledDirectByteBuf.java:570)
at
org.apache.flink.shaded.netty4.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1132)
at
org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:350)
at
org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151)
at
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719)
at
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655)
at
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581)
at
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
at
org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Unknown Source)

Regards,
Alexis.

Re: Kubernetes operator stops responding due to Connection reset by peer

Posted by Gyula Fóra <gy...@gmail.com>.

Hi Alexis,

We have recently added support for canary deployments which allows the
liveness probe to detect general operator problems.

https://issues.apache.org/jira/browse/FLINK-31219

It's not completely automatic and you have to deploy the canaries yourself
but I think it will be helpful :)
This will be part of the upcoming 1.5.0 release.

Cheers,
Gyula

On Fri, Apr 21, 2023 at 11:50 PM Alexis Sarda-Espinosa <
sarda.espinosa@gmail.com> wrote:

> Hello,
>
> Today, we received an alert because the operator appeared to be down. Upon
> further investigation, we realized the alert was triggered because the
> endpoint for Prometheus metrics (which we enabled) stopped responding, so
> it seems the endpoint used for the liveness probe wasn't affected and the
> pod was not restarted automatically.
>
> The logs right before the problem started don't show anything odd, and
> once the problem started, the logs were spammed with warning messages
> stating "Connection reset by peer" with no further information. From what I
> can see, nothing else was logged during that time, so it looks like the
> process really had stalled.
>
> I imagine this is not easy to reproduce and, while a pod restart was
> enough to get back on track, it might be worth improving the liveness probe
> to catch these situations.
>
> Full stacktrace for reference:
>
> An exceptionCaught() event was fired, and it reached at the tail of the
> pipeline. It usually means the last handler in the pipeline did not handle
> the exception.
> java.io.IOException: Connection reset by peer at
> java.base/sun.nio.ch.FileDispatcherImpl.read0(Native Method) at
> java.base/sun.nio.ch.SocketDispatcher.read(Unknown Source) at
> java.base/sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source) at
> java.base/sun.nio.ch.IOUtil.read(Unknown Source) at
> java.base/sun.nio.ch.IOUtil.read(Unknown Source) at
> java.base/sun.nio.ch.SocketChannelImpl.read(Unknown Source) at
> org.apache.flink.shaded.netty4.io.netty.buffer.UnpooledDirectByteBuf.setBytes(UnpooledDirectByteBuf.java:570)
> at
> org.apache.flink.shaded.netty4.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1132)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:350)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
> at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
> at
> org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
> at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.base/java.lang.Thread.run(Unknown Source)
>
> Regards,
> Alexis.
>
>