You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Dmitry Minaev <mi...@gmail.com> on 2018/08/03 21:42:20 UTC
No activity but checkpoints are failing and backpressure is high

Hi everyone,

We have a small QA environment with just one job manager and one task
manager. There are several jobs running with parallelism 1.
There is a problem with one job. During our regular upgrade process one job
wasn't cancelled due to savepoint timeout:

Cancelling job 1b80efe346d437c01e17b6efda640909 with savepoint to
/path/to/nfsrecovery/flink-distribution.

------------------------------------------------------------
The program finished with the following exception:

java.util.concurrent.TimeoutException: Futures timed out after [60000
milliseconds]
       at
scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
       at
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
       at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
       at
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
       at scala.concurrent.Await$.result(package.scala:190)
       at scala.concurrent.Await.result(package.scala)
       at
org.apache.flink.client.program.ClusterClient.cancelWithSavepoint(ClusterClient.java:621)
       at org.apache.flink.client.CliFrontend.cancel(CliFrontend.java:628)
       at
org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1060)
       at org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:1101)
       at org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:1098)
       at java.security.AccessController.doPrivileged(Native Method)
       at javax.security.auth.Subject.doAs(Subject.java:422)
       at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807)
       at
org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
       at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1098)

So we ended up with 2 similar jobs running in parallel (not sure if it's
related to the problem).

There is no activity on this environment now but I'm seeing that there is a
high backpressure on one of the operators of this job. Also, all the
checkpoints are failing by timeout (5 minutes) for this particular job.
Other jobs are all good.

I've looked at the job manager logs and noticed that once a day we have a
connection issue between JM and TM nodes:

01 Aug 2018 22:07:18,613 WARN akka.remote.transport.netty.NettyTransport -
Remote connection to [null] failed with java.net.ConnectException:
Connection refused: qafdsflinkw811.nn.five9lab.com/10.5.61.124:41651
01 Aug 2018 22:07:18,613 WARN akka.remote.ReliableDeliverySupervisor -
Association with remote system [akka.tcp://
flink@qafdsflinkw811.nn.five9lab.com:41651] has failed, address is now
gated for [5000] ms. Reason: [Association failed with [akka.tcp://
flink@qafdsflinkw811.nn.five9lab.com:41651]] Caused by: [Connection
refused: qafdsflinkw811.nn.five9lab.com/10.5.61.124:41651]
02 Aug 2018 22:07:18,700 WARN akka.remote.transport.netty.NettyTransport -
Remote connection to [null] failed with java.net.ConnectException:
Connection refused: qafdsflinkw811.nn.five9lab.com/10.5.61.124:36539
02 Aug 2018 22:07:18,700 WARN akka.remote.ReliableDeliverySupervisor -
Association with remote system [akka.tcp://
flink@qafdsflinkw811.nn.five9lab.com:36539] has failed, address is now
gated for [5000] ms. Reason: [Association failed with [akka.tcp://
flink@qafdsflinkw811.nn.five9lab.com:36539]] Caused by: [Connection
refused: qafdsflinkw811.nn.five9lab.com/10.5.61.124:36539]
02 Aug 2018 22:07:23,502 WARN akka.remote.Remoting - Association to
[akka.tcp://flink@qafdsflinkw811.nn.five9lab.com:42579] with unknown UID is
irrecoverably failed. Address cannot be quarantined without knowing the
UID, gating instead for 5000 ms.

Other than that I don't see anything strange in the logs.

Here is the task manager's memory dump if it can help:
https://drive.google.com/file/d/1T9FqY8faWHmJOPdMC0MunxxFRQbAVDjd/view?usp=sharing

I would very much appreciate any advice to help me solve the problem.

Thank you,
Dmitry Minaev
-- 

--
Dmitry