You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Matt Dailey (JIRA)" <ji...@apache.org> on 2019/06/20 11:56:00 UTC
[jira] [Commented] (FLINK-12385) RestClusterClient can hang indefinitely during job submission

    [ https://issues.apache.org/jira/browse/FLINK-12385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16868467#comment-16868467 ] 

Matt Dailey commented on FLINK-12385:
-------------------------------------

I was not able to get jobmanager debug logs for when the problem occurred, but I think we did find what caused it in our environment.

We were rolling out Istio on Kubernetes, and our best bet is that the client hung when communicating with ZooKeeper because we had a problem where we accidentally defined two Kubernetes services for ZooKeeper, which Istio did not handle well.  We had seen similar problems where clients would hang when connecting to services defined that way.

And that's right, this was in detached mode.

And thanks for the explanation, I think you're right, the underlying connection should hit its timeout and retry limits to and exit from the future, so adding a timeout to the future is probably not the right solution

> RestClusterClient can hang indefinitely during job submission
> -------------------------------------------------------------
>
>                 Key: FLINK-12385
>                 URL: https://issues.apache.org/jira/browse/FLINK-12385
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination, Runtime / REST
>    Affects Versions: 1.8.0
>            Reporter: Matt Dailey
>            Priority: Minor
>
> We have had situations where clients would hang indefinitely during job submission, even when job submission would succeed. We have not yet characterized what happened on the server to cause this, but we thought that the client should have a timeout for these requests.
> This was observed in Flink 1.5.5, but the code seems to still have this problem in 1.8.0. One option is to include a timeout in calls to {{CompletableFuture.get()}}:
>  * [RestClusterClient in 1.5.5|https://github.com/apache/flink/blob/release-1.5.5/flink-clients/src/main/java/org/apache/flink/client/program/rest/RestClusterClient.java#L246]
>  * [RestClusterClient in 1.8.0|https://github.com/apache/flink/blob/release-1.8.0/flink-clients/src/main/java/org/apache/flink/client/program/rest/RestClusterClient.java#L247]
> Thread dump from client running Flink 1.5.5, running in Java 8:
> {noformat}
> http-nio-0.0.0.0-8443-exec-6" #34 daemon prio=5 os_prio=0 tid=0x000055b421fd2000 nid=0x29 waiting on condition [0x00007f932e176000]
>    java.lang.Thread.State: WAITING (parking)
> 	at sun.misc.Unsafe.park(Native Method)
> 	- parking to wait for  <0x00000000b331d7c0> (a java.util.concurrent.CompletableFuture$Signaller)
> 	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> 	at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693)
> 	at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
> 	at java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729)
> 	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
> 	at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:246)
> 	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:464)
> 	at org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:77)
> 	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:410)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)