You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Barisa Obradovic <bb...@gmail.com> on 2021/02/10 15:19:05 UTC

Should flink job manager crash during zookeeper upgrade?

I'm trying to understand if behaviour of the flink jobmanager during
zookeeper upgrade is expected or not.

I'm running flink 1.11.2 in kubernetes, with zookeeper server 3.5.4-beta.
While I'm doing zookeeper upgrade, there is a 20 seconds zookeeper downtime.
I'd expect to either flink job to restart or few warnings in the logs during
those 20 seconds. Instead, I see whole flink JVM crash ( and later the pod
restart).

I expected for flink to internally retry zookeeper requests, so I'm
surprised it crashes. Is this expected, or is it a bug?

From the logs

org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68)
~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
[09-Feb-2021 11:30:00.197 UTC] INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
Opening socket connection to server zdzk.servicexxx/192.168.190.92:2181
[09-Feb-2021 11:30:00.197 UTC] INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
Socket connection established to zdzk.servicexxx/192.168.190.92:2181,
initiating session
[09-Feb-2021 11:30:00.198 UTC] WARN
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
Session 0x3012b0057140004 for server zdzk.servicexxx/192.168.190.92:2181,
unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.8.0_192]
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
~[?:1.8.0_192]
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
~[?:1.8.0_192]
    at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[?:1.8.0_192]
    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
~[?:1.8.0_192]
    at
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68)
~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
[09-Feb-2021 11:30:02.294 UTC] INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
Opening socket connection to server zdzk.servicexxx/192.168.190.92:2181
[09-Feb-2021 11:30:02.295 UTC] INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
Socket connection established to zdzk.servicexxx/192.168.190.92:2181,
initiating session
[09-Feb-2021 11:30:02.295 UTC] WARN
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
Session 0x3012b0057140004 for server zdzk.servicexxx/192.168.190.92:2181,
unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.8.0_192]
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
~[?:1.8.0_192]
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
~[?:1.8.0_192]
    at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[?:1.8.0_192]
    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
~[?:1.8.0_192]
    at
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68)
~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
[09-Feb-2021 11:30:03.841 UTC] INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
Opening socket connection to server zdzk.servicexxx/192.168.190.92:2181
[09-Feb-2021 11:30:03.842 UTC] INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
Socket connection established to zdzk.servicexxx/192.168.190.92:2181,
initiating session
[09-Feb-2021 11:30:03.842 UTC] WARN
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
Session 0x3012b0057140004 for server zdzk.servicexxx/192.168.190.92:2181,
unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.8.0_192]
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
~[?:1.8.0_192]
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
~[?:1.8.0_192]
    at sun.nio.ch.IOUtil.rea



FYI: I've asked same question on stackoverflow:
https://stackoverflow.com/questions/66120905/should-flink-job-manager-crash-during-zookeeper-upgrade



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Should flink job manager crash during zookeeper upgrade?

Posted by Barisa Obradovic <bb...@gmail.com>.
Thank you Till, that's perfect.
I increased the max retry attempts a bit, and now it works like a charm ( no
restarts ).




--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Should flink job manager crash during zookeeper upgrade?

Posted by Till Rohrmann <tr...@apache.org>.
Hi Barisa,

Could you give us the full logs of the run? It looks a bit that you
exceeded the maximum retry attempts while you upgraded your ZooKeeper
cluster. You can increase it via recovery.zookeeper.client.retry-wait
and recovery.zookeeper.client.max-retry-attempts.

From Flink's perspective it is intended that the system fails after some
time when it cannot connect to the ZooKeeper cluster.

Cheers,
Till

On Wed, Feb 10, 2021 at 10:43 PM Barisa Obradovic <bb...@gmail.com> wrote:

> Great, thank you for help Matthias
>
>
>
> --
> Sent from:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>

Re: Should flink job manager crash during zookeeper upgrade?

Posted by Barisa Obradovic <bb...@gmail.com>.
Great, thank you for help Matthias



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Should flink job manager crash during zookeeper upgrade?

Posted by Matthias Pohl <ma...@ververica.com>.
Hi Barisa,
thanks for sharing this. I'm gonna add Till to this thread. He might have
some insights.

Best,
Matthias

On Wed, Feb 10, 2021 at 4:19 PM Barisa Obradovic <bb...@gmail.com> wrote:

> I'm trying to understand if behaviour of the flink jobmanager during
> zookeeper upgrade is expected or not.
>
> I'm running flink 1.11.2 in kubernetes, with zookeeper server 3.5.4-beta.
> While I'm doing zookeeper upgrade, there is a 20 seconds zookeeper
> downtime.
> I'd expect to either flink job to restart or few warnings in the logs
> during
> those 20 seconds. Instead, I see whole flink JVM crash ( and later the pod
> restart).
>
> I expected for flink to internally retry zookeeper requests, so I'm
> surprised it crashes. Is this expected, or is it a bug?
>
> From the logs
>
>
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68)
> ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
>     at
>
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
> ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
>     at
>
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
> [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
> [09-Feb-2021 11:30:00.197 UTC] INFO
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
> Opening socket connection to server zdzk.servicexxx/192.168.190.92:2181
> [09-Feb-2021 11:30:00.197 UTC] INFO
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
> Socket connection established to zdzk.servicexxx/192.168.190.92:2181,
> initiating session
> [09-Feb-2021 11:30:00.198 UTC] WARN
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
> Session 0x3012b0057140004 for server zdzk.servicexxx/192.168.190.92:2181,
> unexpected error, closing socket connection and attempting reconnect
> java.io.IOException: Connection reset by peer
>     at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.8.0_192]
>     at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> ~[?:1.8.0_192]
>     at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> ~[?:1.8.0_192]
>     at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[?:1.8.0_192]
>     at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
> ~[?:1.8.0_192]
>     at
>
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68)
> ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
>     at
>
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
> ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
>     at
>
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
> [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
> [09-Feb-2021 11:30:02.294 UTC] INFO
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
> Opening socket connection to server zdzk.servicexxx/192.168.190.92:2181
> [09-Feb-2021 11:30:02.295 UTC] INFO
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
> Socket connection established to zdzk.servicexxx/192.168.190.92:2181,
> initiating session
> [09-Feb-2021 11:30:02.295 UTC] WARN
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
> Session 0x3012b0057140004 for server zdzk.servicexxx/192.168.190.92:2181,
> unexpected error, closing socket connection and attempting reconnect
> java.io.IOException: Connection reset by peer
>     at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.8.0_192]
>     at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> ~[?:1.8.0_192]
>     at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> ~[?:1.8.0_192]
>     at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[?:1.8.0_192]
>     at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
> ~[?:1.8.0_192]
>     at
>
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68)
> ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
>     at
>
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
> ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
>     at
>
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
> [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
> [09-Feb-2021 11:30:03.841 UTC] INFO
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
> Opening socket connection to server zdzk.servicexxx/192.168.190.92:2181
> [09-Feb-2021 11:30:03.842 UTC] INFO
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
> Socket connection established to zdzk.servicexxx/192.168.190.92:2181,
> initiating session
> [09-Feb-2021 11:30:03.842 UTC] WARN
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
> Session 0x3012b0057140004 for server zdzk.servicexxx/192.168.190.92:2181,
> unexpected error, closing socket connection and attempting reconnect
> java.io.IOException: Connection reset by peer
>     at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.8.0_192]
>     at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> ~[?:1.8.0_192]
>     at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> ~[?:1.8.0_192]
>     at sun.nio.ch.IOUtil.rea
>
>
>
> FYI: I've asked same question on stackoverflow:
>
> https://stackoverflow.com/questions/66120905/should-flink-job-manager-crash-during-zookeeper-upgrade
>
>
>
> --
> Sent from:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/