You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Hao Sun <ha...@zendesk.com> on 2017/11/16 22:37:11 UTC

Task manager suddenly lost connection to JM

Hi team, I see an wired issue that one of my TM suddenly lost connection to
JM.
Once the job running on the TM relocated to a new TM, it can reconnect to
JM again.
And after a while, the new TM running the same job will repeat the same
process.
It is not guaranteed the troubled TMs can reconnect to JM in a reasonable
time frame, like minutes. Sometime it take days in order to reconnect
successfully.

I am using Flink 1.3.2 and Kubernetes. Is this because of network
congestion?

Thanks!

===== Logs from JM ======

*2017-11-16 19:14:40,216 WARN  akka.remote.RemoteWatcher*
                       - Detected unreachable:
[akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416]
2017-11-16 19:14:40,218 INFO
org.apache.flink.runtime.jobmanager.JobManager                - Task
manager akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416/user/taskmanager
terminated.
2017-11-16 19:14:40,219 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        -
Source: KafkaSource(maxwell.tickets) ->
MaxwellFilter->Maxwell(maxwell.tickets) ->
FixedDelayWatermark(maxwell.tickets) ->
MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1)
(484ebabbd13dce5e8503d88005bcdb6c) switched from RUNNING to FAILED.
java.lang.Exception: TaskManager was lost/killed:
50cae001c1d97e55889a6051319f4746 @
fps-flink-taskmanager-701246518-rb4k6 (dataPort=43871)
	at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:217)
	at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:533)
	at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:192)
	at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:167)
	at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:212)
	at org.apache.flink.runtime.jobmanager.JobManager.org$apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated(JobManager.scala:1228)
	at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:1131)
	at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
	at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49)
	at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
	at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
	at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
	at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
	at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
	at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
	at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:125)
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
	at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:44)
	at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
	at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
	at akka.actor.ActorCell.invoke(ActorCell.scala:486)
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
	at akka.dispatch.Mailbox.run(Mailbox.scala:220)
	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)*2017-11-16
19:14:40,219 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
KafkaDemo (env:production) (3de8f35a1af689237e3c5c94023aba3f) switched
from state RUNNING to FAILING.
java.lang.Exception: TaskManager was lost/killed:
50cae001c1d97e55889a6051319f4746 @
fps-flink-taskmanager-701246518-rb4k6 (dataPort=43871)*
	at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:217)
	at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:533)
	at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:192)
	at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:167)
	at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:212)
	at org.apache.flink.runtime.jobmanager.JobManager.org$apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated(JobManager.scala:1228)
	at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:1131)
	at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
	at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49)
	at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
	at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
	at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
	at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
	at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
	at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
	at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:125)
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
	at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:44)
	at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
	at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
	at akka.actor.ActorCell.invoke(ActorCell.scala:486)
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
	at akka.dispatch.Mailbox.run(Mailbox.scala:220)
	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2017-11-16 19:14:40,227 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Try to
restart or fail the job KafkaDemo (env:production)
(3de8f35a1af689237e3c5c94023aba3f) if no longer possible.
2017-11-16 19:14:40,227 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
KafkaDemo (env:production) (3de8f35a1af689237e3c5c94023aba3f) switched
from state FAILING to RESTARTING.
2017-11-16 19:14:40,227 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        -
Restarting the job KafkaDemo (env:production)
(3de8f35a1af689237e3c5c94023aba3f).
2017-11-16 19:14:40,229 INFO
org.apache.flink.runtime.instance.InstanceManager             -
Unregistered task manager
fps-flink-taskmanager-701246518-rb4k6/10.225.131.3. Number of
registered task managers 3. Number of available slots 3.
2017-11-16 19:14:40,301 WARN  akka.remote.ReliableDeliverySupervisor
                     - Association with remote system
[akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416] has
failed, address is now gated for [5000] ms. Reason: [Association
failed with [akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416]]
Caused by: [fps-flink-taskmanager-701246518-rb4k6: Name does not
resolve]
2017-11-16 19:14:50,228 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
KafkaDemo (env:production) (3de8f35a1af689237e3c5c94023aba3f) switched
from state RESTARTING to CREATED.
2017-11-16 19:14:50,228 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     -
Restoring from latest valid checkpoint: Checkpoint 646 @ 1510859465870
for 3de8f35a1af689237e3c5c94023aba3f.
2017-11-16 19:14:50,233 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - No
master state to restore
2017-11-16 19:14:50,233 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
KafkaDemo (env:production) (3de8f35a1af689237e3c5c94023aba3f) switched
from state CREATED to RUNNING.
2017-11-16 19:14:50,233 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        -
Source: KafkaSource(maxwell.tickets) ->
MaxwellFilter->Maxwell(maxwell.tickets) ->
FixedDelayWatermark(maxwell.tickets) ->
MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1)
(6ae48dd06a2a48dce68a3ea83c21a462) switched from CREATED to SCHEDULED.
2017-11-16 19:14:50,234 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        -
Source: KafkaSource(maxwell.tickets) ->
MaxwellFilter->Maxwell(maxwell.tickets) ->
FixedDelayWatermark(maxwell.tickets) ->
MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1)
(6ae48dd06a2a48dce68a3ea83c21a462) switched from SCHEDULED to
DEPLOYING.
2017-11-16 19:14:50,234 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        -
Deploying Source: KafkaSource(maxwell.tickets) ->
MaxwellFilter->Maxwell(maxwell.tickets) ->
FixedDelayWatermark(maxwell.tickets) ->
MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1) (attempt #1) to
fps-flink-taskmanager-701246518-v9vjx
2017-11-16 19:14:50,244 WARN  akka.remote.ReliableDeliverySupervisor
                     - Association with remote system
[akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416] has
failed, address is now gated for [5000] ms. Reason: [Association
failed with [akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416]]
Caused by: [fps-flink-taskmanager-701246518-rb4k6]
2017-11-16 19:14:50,987 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        -
Source: KafkaSource(maxwell.tickets) ->
MaxwellFilter->Maxwell(maxwell.tickets) ->
FixedDelayWatermark(maxwell.tickets) ->
MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1)
(6ae48dd06a2a48dce68a3ea83c21a462) switched from DEPLOYING to RUNNING.
2017-11-16 19:15:24,480 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     -
Triggering checkpoint 649 @ 1510859724479
2017-11-16 19:15:25,638 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     -
Completed checkpoint 649 (6724 bytes in 928 ms).
2017-11-16 19:16:22,205 WARN  Remoting
                     - Tried to associate with unreachable remote
address [akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416].
Address is now gated for 5000 ms, all messages to this address will be
delivered to dead letters. Reason: [The remote system has quarantined
this system. No further associations to the remote system are possible
until this system is restarted.]
2017-11-16 19:16:50,233 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     -
Triggering checkpoint 648 @ 1510859810233
2017-11-16 19:17:12,972 INFO
org.apache.flink.runtime.jobmanager.JobManager                -
Received message for non-existing checkpoint 647
2017-11-16 19:17:13,685 INFO
org.apache.flink.runtime.instance.InstanceManager             -
Registering TaskManager at
fps-flink-taskmanager-701246518-rb4k6/10.225.131.3 which was marked as
dead earlier because of a heart-beat timeout.
2017-11-16 19:17:13,685 INFO
org.apache.flink.runtime.instance.InstanceManager             -
Registered TaskManager at fps-flink-taskmanager-701246518-rb4k6
(akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416/user/taskmanager)
as 8905a72ca2655b9ed0bddb50ff3c49a9. Current number of registered
hosts is 4. Current number of alive task slots is 4.
2017-11-16 19:17:24,480 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     -
Triggering checkpoint 650 @ 1510859844480


===== Logs from TM ======

2017-11-16 19:13:33,177 INFO
org.apache.flink.runtime.state.DefaultOperatorStateBackend    -
DefaultOperatorStateBackend snapshot (File Stream Factory @
s3a://zendesk-usw2-fraud-prevention-production/checkpoints/3de8f35a1af689237e3c5c94023aba3f,
synchronous part) in thread Thread[Async calls on Source:
KafkaSource(maxwell.tickets) ->
MaxwellFilter->Maxwell(maxwell.tickets) ->
FixedDelayWatermark(maxwell.tickets) ->
MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1),5,Flink Task
Threads] took 1 ms.*2017-11-16 19:14:35,820 WARN
akka.remote.RemoteWatcher                                     -
Detected unreachable: [akka.tcp://flink@fps-flink-jobmanager:6123]*
2017-11-16 19:14:37,525 INFO
org.apache.flink.runtime.taskmanager.TaskManager              -
TaskManager akka://flink/user/taskmanager disconnects from JobManager
akka.tcp://flink@fps-flink-jobmanager:6123/user/jobmanager: JobManager
is no longer reachable
2017-11-16 19:15:50,827 INFO
org.apache.flink.runtime.taskmanager.TaskManager              -
Cancelling all computations and discarding all cached data.
2017-11-16 19:16:23,596 INFO  com.amazonaws.http.AmazonHttpClient
                     - Unable to execute HTTP request: The target
server failed to respond
org.apache.http.NoHttpResponseException: The target server failed to respond
	at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:95)
	at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:62)
	at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:254)
	at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:289)
	at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:252)
	at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:191)
	at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:300)
	at com.amazonaws.http.protocol.SdkHttpRequestExecutor.doReceiveResponse(SdkHttpRequestExecutor.java:66)
	at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:127)
	at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:715)
	at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:520)
	at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
	at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
	at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:384)
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
	at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:976)
	at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:956)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:892)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.deleteUnnecessaryFakeDirectories(S3AFileSystem.java:1144)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.finishedWrite(S3AFileSystem.java:1133)
	at org.apache.hadoop.fs.s3a.S3AOutputStream.close(S3AOutputStream.java:142)
	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
	at org.apache.flink.runtime.fs.hdfs.HadoopDataOutputStream.close(HadoopDataOutputStream.java:48)
	at org.apache.flink.core.fs.ClosingFSDataOutputStream.close(ClosingFSDataOutputStream.java:64)
	at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:319)
	at org.apache.flink.runtime.checkpoint.AbstractAsyncSnapshotIOCallable.closeStreamAndGetStateHandle(AbstractAsyncSnapshotIOCallable.java:100)
	at org.apache.flink.runtime.state.DefaultOperatorStateBackend$1.performOperation(DefaultOperatorStateBackend.java:270)
	at org.apache.flink.runtime.state.DefaultOperatorStateBackend$1.performOperation(DefaultOperatorStateBackend.java:233)
	at org.apache.flink.runtime.io.async.AbstractAsyncIOCallable.call(AbstractAsyncIOCallable.java:72)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:40)
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:906)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)
2017-11-16 19:16:35,816 INFO
org.apache.flink.runtime.taskmanager.Task                     -
Attempting to fail task externally Source:
KafkaSource(maxwell.tickets) ->
MaxwellFilter->Maxwell(maxwell.tickets) ->
FixedDelayWatermark(maxwell.tickets) ->
MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1)
(484ebabbd13dce5e8503d88005bcdb6c).
2017-11-16 19:16:35,819 WARN  Remoting
                     - Tried to associate with unreachable remote
address [akka.tcp://flink@fps-flink-jobmanager:6123]. Address is now
gated for 5000 ms, all messages to this address will be delivered to
dead letters. Reason: [The remote system has a UID that has been
quarantined. Association aborted.] *2017-11-16 19:16:45,817 INFO
org.apache.flink.runtime.taskmanager.Task                     -
Source: KafkaSource(maxwell.tickets) ->
MaxwellFilter->Maxwell(maxwell.tickets) ->
FixedDelayWatermark(maxwell.tickets) ->
MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1)
(484ebabbd13dce5e8503d88005bcdb6c) switched from RUNNING to FAILED.
java.lang.Exception: TaskManager akka://flink/user/taskmanager
disconnects from JobManager
akka.tcp://flink@fps-flink-jobmanager:6123/user/jobmanager: JobManager
is no longer reachable*
	at org.apache.flink.runtime.taskmanager.TaskManager.handleJobManagerDisconnect(TaskManager.scala:1095)
	at org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$handleMessage$1.applyOrElse(TaskManager.scala:311)
	at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
	at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49)
	at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
	at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
	at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
	at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
	at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
	at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
	at org.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.scala:120)
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
	at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:44)
	at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
	at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
	at akka.actor.ActorCell.invoke(ActorCell.scala:486)
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
	at akka.dispatch.Mailbox.run(Mailbox.scala:220)
	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2017-11-16 19:16:55,815 INFO
org.apache.flink.runtime.state.DefaultOperatorStateBackend    -
DefaultOperatorStateBackend snapshot (File Stream Factory @
s3a://zendesk-usw2-fraud-prevention-production/checkpoints/3de8f35a1af689237e3c5c94023aba3f,
asynchronous part) in thread Thread[pool-7-thread-647,5,Flink Task
Threads] took 202637 ms.
2017-11-16 19:16:55,816 INFO
org.apache.flink.runtime.taskmanager.Task                     -
Triggering cancellation of task code Source:
KafkaSource(maxwell.tickets) ->
MaxwellFilter->Maxwell(maxwell.tickets) ->
FixedDelayWatermark(maxwell.tickets) ->
MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1)
(484ebabbd13dce5e8503d88005bcdb6c).
2017-11-16 19:17:00,825 INFO
org.apache.flink.runtime.taskmanager.TaskManager              -
Disassociating from JobManager
2017-11-16 19:17:00,825 INFO  org.apache.flink.runtime.blob.BlobCache
                     - Shutting down BlobCache
2017-11-16 19:17:13,677 INFO
org.apache.flink.runtime.taskmanager.TaskManager              - Trying
to register at JobManager
akka.tcp://flink@fps-flink-jobmanager:6123/user/jobmanager (attempt 1,
timeout: 500 milliseconds)*2017-11-16 19:17:13,687 INFO
org.apache.flink.runtime.taskmanager.TaskManager              -
Successful registration at JobManager
(akka.tcp://flink@fps-flink-jobmanager:6123/user/jobmanager), starting
network stack and library cache.
2017-11-16 19:17:13,687 INFO
org.apache.flink.runtime.taskmanager.TaskManager              -
Determined BLOB server address to be
fps-flink-jobmanager/10.231.37.240:6124 <http://10.231.37.240:6124>.
Starting BLOB cache.
2017-11-16 19:17:13,687 INFO  org.apache.flink.runtime.blob.BlobCache
                     - Created BLOB cache storage directory
/tmp/blobStore-133033d7-a3c1-4ffb-b360-3d0efdc355aa*
2017-11-16 19:17:31,097 WARN
org.apache.flink.runtime.taskmanager.Task                     - Task
'Source: KafkaSource(maxwell.tickets) ->
MaxwellFilter->Maxwell(maxwell.tickets) ->
FixedDelayWatermark(maxwell.tickets) ->
MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1)' did not react to
cancelling signal, but is stuck in method:
 java.util.regex.Pattern.matcher(Pattern.java:1093)
java.lang.String.replace(String.java:2239)
com.trueaccord.scalapb.textformat.TextFormatUtils$.escapeDoubleQuotesAndBackslashes(TextFormatUtils.scala:160)
com.trueaccord.scalapb.textformat.TextGenerator.addMaybeEscape(TextGenerator.scala:29)
com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:74)
com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:41)
com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:28)
com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)
com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)
com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:70)
com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:37)
com.trueaccord.scalapb.textformat.Printer$$anonfun$printField$1.apply(Printer.scala:25)
com.trueaccord.scalapb.textformat.Printer$$anonfun$printField$1.apply(Printer.scala:25)
scala.collection.Iterator$class.foreach(Iterator.scala:742)
scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
scala.collection.AbstractIterable.foreach(Iterable.scala:54)
com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:25)
com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)
com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)
com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:70)
com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:37)
com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:28)
com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)
com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)
com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:8)
com.trueaccord.scalapb.textformat.Printer$.printToString(Printer.scala:19)
com.trueaccord.scalapb.TextFormat$.printToUnicodeString(TextFormat.scala:34)
com.zendesk.fraud_prevention.data_types.MaxwellEvent.toString(MaxwellEvent.scala:212)
java.lang.String.valueOf(String.java:2994)
scala.collection.mutable.StringBuilder.append(StringBuilder.scala:200)
scala.collection.TraversableOnce$$anonfun$addString$1.apply(TraversableOnce.scala:362)
scala.collection.immutable.List.foreach(List.scala:381)
scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:355)
scala.collection.AbstractTraversable.addString(Traversable.scala:104)
scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:321)
scala.collection.AbstractTraversable.mkString(Traversable.scala:104)
scala.collection.TraversableLike$class.toString(TraversableLike.scala:645)
scala.collection.SeqLike$class.toString(SeqLike.scala:682)
scala.collection.AbstractSeq.toString(Seq.scala:41)
java.lang.String.valueOf(String.java:2994)
java.lang.StringBuilder.append(StringBuilder.java:131)
scala.StringContext.standardInterpolator(StringContext.scala:125)
scala.StringContext.s(StringContext.scala:95)
com.zendesk.fraud_prevention.functions.MaxwellFilterFlatMapper.flatMap(MaxwellFilterFlatMapper.scala:12)
com.zendesk.fraud_prevention.functions.MaxwellFilterFlatMapper.flatMap(MaxwellFilterFlatMapper.scala:10)
org.apache.flink.streaming.api.operators.StreamFlatMap.processElement(StreamFlatMap.java:50)
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:528)
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:503)
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:483)
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:891)
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:869)
org.apache.flink.streaming.api.operators.StreamSourceContexts$ManualWatermarkContext.processAndCollect(StreamSourceContexts.java:304)
org.apache.flink.streaming.api.operators.StreamSourceContexts$WatermarkContext.collect(StreamSourceContexts.java:393)
org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher.emitRecord(AbstractFetcher.java:233)
org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.emitRecord(Kafka09Fetcher.java:193)
org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.runFetchLoop(Kafka09Fetcher.java:152)
org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.run(FlinkKafkaConsumerBase.java:483)
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:87)
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:55)
org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:95)
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:263)
org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)
java.lang.Thread.run(Thread.java:748)

2017-11-16 19:17:31,106 WARN
org.apache.flink.runtime.taskmanager.Task                     - Task
'Source: KafkaSource(maxwell.tickets) ->
MaxwellFilter->Maxwell(maxwell.tickets) ->
FixedDelayWatermark(maxwell.tickets) ->
MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1)' did not react to
cancelling signal, but is stuck in method:
 com.trueaccord.scalapb.textformat.TextGenerator.add(TextGenerator.scala:19)
com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:33)
com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:28)
com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)
com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)
com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:70)
com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:37)
com.trueaccord.scalapb.textformat.Printer$$anonfun$printField$1.apply(Printer.scala:25)
com.trueaccord.scalapb.textformat.Printer$$anonfun$printField$1.apply(Printer.scala:25)
scala.collection.Iterator$class.foreach(Iterator.scala:742)
scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
scala.collection.AbstractIterable.foreach(Iterable.scala:54)
com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:25)
com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)
com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)
com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:70)
com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:37)
com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:28)
com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)
com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)
com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:8)
com.trueaccord.scalapb.textformat.Printer$.printToString(Printer.scala:19)
com.trueaccord.scalapb.TextFormat$.printToUnicodeString(TextFormat.scala:34)
com.zendesk.fraud_prevention.data_types.MaxwellEvent.toString(MaxwellEvent.scala:212)
java.lang.String.valueOf(String.java:2994)
scala.collection.mutable.StringBuilder.append(StringBuilder.scala:200)
scala.collection.TraversableOnce$$anonfun$addString$1.apply(TraversableOnce.scala:362)
scala.collection.immutable.List.foreach(List.scala:381)
scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:355)
scala.collection.AbstractTraversable.addString(Traversable.scala:104)
scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:321)
scala.collection.AbstractTraversable.mkString(Traversable.scala:104)
scala.collection.TraversableLike$class.toString(TraversableLike.scala:645)
scala.collection.SeqLike$class.toString(SeqLike.scala:682)
scala.collection.AbstractSeq.toString(Seq.scala:41)
java.lang.String.valueOf(String.java:2994)
java.lang.StringBuilder.append(StringBuilder.java:131)
scala.StringContext.standardInterpolator(StringContext.scala:125)
scala.StringContext.s(StringContext.scala:95)
com.zendesk.fraud_prevention.functions.MaxwellFilterFlatMapper.flatMap(MaxwellFilterFlatMapper.scala:12)
com.zendesk.fraud_prevention.functions.MaxwellFilterFlatMapper.flatMap(MaxwellFilterFlatMapper.scala:10)
org.apache.flink.streaming.api.operators.StreamFlatMap.processElement(StreamFlatMap.java:50)
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:528)
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:503)
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:483)
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:891)
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:869)
org.apache.flink.streaming.api.operators.StreamSourceContexts$ManualWatermarkContext.processAndCollect(StreamSourceContexts.java:304)
org.apache.flink.streaming.api.operators.StreamSourceContexts$WatermarkContext.collect(StreamSourceContexts.java:393)
org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher.emitRecord(AbstractFetcher.java:233)
org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.emitRecord(Kafka09Fetcher.java:193)
org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.runFetchLoop(Kafka09Fetcher.java:152)
org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.run(FlinkKafkaConsumerBase.java:483)
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:87)
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:55)
org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:95)
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:263)
org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)
java.lang.Thread.run(Thread.java:748)

2017-11-16 19:18:05,486 WARN
org.apache.flink.runtime.taskmanager.Task                     - Task
'Source: KafkaSource(maxwell.tickets) ->
MaxwellFilter->Maxwell(maxwell.tickets) ->
FixedDelayWatermark(maxwell.tickets) ->
MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1)' did not react to
cancelling signal, but is stuck in method:
 java.lang.String.replace(String.java:2240)
com.trueaccord.scalapb.textformat.TextGenerator.addMaybeEscape(TextGenerator.scala:29)
com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:74)
com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:41)
com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:28)
com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)
com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)
com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:70)
com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:37)
com.trueaccord.scalapb.textformat.Printer$$anonfun$printField$1.apply(Printer.scala:25)
com.trueaccord.scalapb.textformat.Printer$$anonfun$printField$1.apply(Printer.scala:25)
scala.collection.Iterator$class.foreach(Iterator.scala:742)
scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
scala.collection.AbstractIterable.foreach(Iterable.scala:54)
com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:25)
com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)
com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)
com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:70)
com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:37)
com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:28)
com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)
com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)
com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:8)
com.trueaccord.scalapb.textformat.Printer$.printToString(Printer.scala:19)
com.trueaccord.scalapb.TextFormat$.printToUnicodeString(TextFormat.scala:34)
com.zendesk.fraud_prevention.data_types.MaxwellEvent.toString(MaxwellEvent.scala:212)
java.lang.String.valueOf(String.java:2994)
java.lang.StringBuilder.append(StringBuilder.java:131)
scala.StringContext.standardInterpolator(StringContext.scala:125)
scala.StringContext.s(StringContext.scala:95)
com.zendesk.fraud_prevention.functions.MaxwellFilterFlatMapper$$anonfun$flatMap$1.apply(MaxwellFilterFlatMapper.scala:14)
com.zendesk.fraud_prevention.functions.MaxwellFilterFlatMapper$$anonfun$flatMap$1.apply(MaxwellFilterFlatMapper.scala:13)
scala.collection.immutable.List.foreach(List.scala:381)
com.zendesk.fraud_prevention.functions.MaxwellFilterFlatMapper.flatMap(MaxwellFilterFlatMapper.scala:13)
com.zendesk.fraud_prevention.functions.MaxwellFilterFlatMapper.flatMap(MaxwellFilterFlatMapper.scala:10)
org.apache.flink.streaming.api.operators.StreamFlatMap.processElement(StreamFlatMap.java:50)
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:528)
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:503)
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:483)
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:891)
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:869)
org.apache.flink.streaming.api.operators.StreamSourceContexts$ManualWatermarkContext.processAndCollect(StreamSourceContexts.java:304)
org.apache.flink.streaming.api.operators.StreamSourceContexts$WatermarkContext.collect(StreamSourceContexts.java:393)
org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher.emitRecord(AbstractFetcher.java:233)
org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.emitRecord(Kafka09Fetcher.java:193)
org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.runFetchLoop(Kafka09Fetcher.java:152)
org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.run(FlinkKafkaConsumerBase.java:483)
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:87)
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:55)
org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:95)
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:263)
org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)
java.lang.Thread.run(Thread.java:748)

2017-11-16 19:18:34,375 INFO
org.apache.flink.runtime.taskmanager.Task                     -
Freeing task resources for Source: KafkaSource(maxwell.tickets) ->
MaxwellFilter->Maxwell(maxwell.tickets) ->
FixedDelayWatermark(maxwell.tickets) ->
MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1)
(484ebabbd13dce5e8503d88005bcdb6c).
2017-11-16 19:18:36,257 INFO
org.apache.flink.runtime.taskmanager.Task                     -
Ensuring all FileSystem streams are closed for task Source:
KafkaSource(maxwell.tickets) ->
MaxwellFilter->Maxwell(maxwell.tickets) ->
FixedDelayWatermark(maxwell.tickets) ->
MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1)
(484ebabbd13dce5e8503d88005bcdb6c) [FAILED]
2017-11-16 19:18:36,618 ERROR
org.apache.flink.runtime.taskmanager.TaskManager              - Cannot
find task with ID 484ebabbd13dce5e8503d88005bcdb6c to unregister.

Re: Task manager suddenly lost connection to JM

Posted by Hao Sun <ha...@zendesk.com>.
Thanks! Let me try.

On Mon, Nov 20, 2017, 12:46 Stephan Ewen <se...@apache.org> wrote:

> We recently observed something like that in some tests and the problem was
> the following:
>
> A Netty dependency pulled in via Hadoop or ZooKeeper conflicted with
> Akka's Netty dependency, which lead to remote connection failures.
>
> In Flink 1.4, we fix that by shading Akka's Netty to make sure this cannot
> happen.
>
> If that is in fact the problem you see, in Flink 1.3.2, you need to
> exclude Hadoop's Netty / ZooKeeper's Netty from the classpath.
>
> Best,
> Stephan
>
>
> On Mon, Nov 20, 2017 at 6:02 AM, Chan, Regina <Re...@gs.com> wrote:
>
>> I have a similar problem where I lose Task Managers. I originally thought
>> it had to do with memory issues but it doesn’t look that’s the case… Any
>> ideas here? Am I missing something obvious?
>>
>>
>>
>> 11/19/2017 23:43:51     CHAIN DataSource (at
>> createInput(ExecutionEnvironment.java:553)
>> (org.apache.flink.api.java.hadoop.mapreduce.HadoopInputFormat)) -> Filter
>> (Filter at readParquetFile(FlinkRefinerOperations.java:899)) -> Map (Map at
>> readParquetFile(FlinkRefinerOperations.java:902)) -> Map (Map at
>> handleMilestoning(MergeTask.java:259))(1/1) switched to FAILED
>>
>> java.lang.Exception: TaskManager was lost/killed:
>> ResourceID{resourceId='container_e5420_1511132977419_37521_01_000027'} @
>> d173636-336.dc.gs.com <http://d173636-336.dc.gs.com>
>> (dataPort=45003)
>>
>>         at
>> org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:217)
>>
>>         at
>> org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:533)
>>
>>         at
>> org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:192)
>>
>>         at
>> org.apache.flink.runtime.instance.Instance.markDead(Instance.java:167)
>>
>>         at
>> org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:212)
>>
>>
>>
>>
>>
>> Thanks,
>>
>> Regina
>>
>>
>>
>> *From:* Hao Sun [mailto:hasun@zendesk.com]
>> *Sent:* Thursday, November 16, 2017 5:37 PM
>> *To:* user
>> *Subject:* Task manager suddenly lost connection to JM
>>
>>
>>
>> Hi team, I see an wired issue that one of my TM suddenly lost connection
>> to JM.
>>
>> Once the job running on the TM relocated to a new TM, it can reconnect to
>> JM again.
>>
>> And after a while, the new TM running the same job will repeat the same
>> process.
>>
>> It is not guaranteed the troubled TMs can reconnect to JM in a reasonable
>> time frame, like minutes. Sometime it take days in order to reconnect
>> successfully.
>>
>>
>>
>> I am using Flink 1.3.2 and Kubernetes. Is this because of network
>> congestion?
>>
>>
>>
>> Thanks!
>>
>>
>>
>> ===== Logs from JM ======
>>
>> *2017-11-16 19:14:40,216 WARN  akka.remote.RemoteWatcher*                                     - Detected unreachable: [akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416]
>>
>> 2017-11-16 19:14:40,218 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Task manager akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416/user/taskmanager terminated.
>>
>> 2017-11-16 19:14:40,219 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1) (484ebabbd13dce5e8503d88005bcdb6c) switched from RUNNING to FAILED.
>>
>> java.lang.Exception: TaskManager was lost/killed: 50cae001c1d97e55889a6051319f4746 @ fps-flink-taskmanager-701246518-rb4k6 (dataPort=43871)
>>
>>   at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:217)
>>
>>   at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:533)
>>
>>   at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:192)
>>
>>   at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:167)
>>
>>   at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:212)
>>
>>   at org.apache.flink.runtime.jobmanager.JobManager.org <https://urldefense.proofpoint.com/v2/url?u=http-3A__org.apache.flink.runtime.jobmanager.JobManager.org&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=nAwcmsyRU3-7VPjrj0DiK-dDoFxIx4Nmqgrfges--HM&s=SXhWPJc2OVtNUwdXddIvhlcMhDOb-C08U42xcujhvss&e=>$apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated(JobManager.scala:1228)
>>
>>   at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:1131)
>>
>>   at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>>
>>   at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49)
>>
>>   at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>>
>>   at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>>
>>   at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>>
>>   at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>>
>>   at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>>
>>   at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
>>
>>   at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:125)
>>
>>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>
>>   at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:44)
>>
>>   at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
>>
>>   at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
>>
>>   at akka.actor.ActorCell.invoke(ActorCell.scala:486)
>>
>>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>>
>>   at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>>
>>   at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
>>
>>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>
>>   at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>
>>   at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>
>>   at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>
>> *2017-11-16 19:14:40,219 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job KafkaDemo (env:production) (3de8f35a1af689237e3c5c94023aba3f) switched from state RUNNING to FAILING.*
>>
>> *java.lang.Exception: TaskManager was lost/killed: 50cae001c1d97e55889a6051319f4746 @ fps-flink-taskmanager-701246518-rb4k6 (dataPort=43871)*
>>
>>   at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:217)
>>
>>   at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:533)
>>
>>   at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:192)
>>
>>   at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:167)
>>
>>   at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:212)
>>
>>   at org.apache.flink.runtime.jobmanager.JobManager.org <https://urldefense.proofpoint.com/v2/url?u=http-3A__org.apache.flink.runtime.jobmanager.JobManager.org&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=nAwcmsyRU3-7VPjrj0DiK-dDoFxIx4Nmqgrfges--HM&s=SXhWPJc2OVtNUwdXddIvhlcMhDOb-C08U42xcujhvss&e=>$apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated(JobManager.scala:1228)
>>
>>   at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:1131)
>>
>>   at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>>
>>   at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49)
>>
>>   at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>>
>>   at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>>
>>   at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>>
>>   at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>>
>>   at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>>
>>   at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
>>
>>   at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:125)
>>
>>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>
>>   at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:44)
>>
>>   at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
>>
>>   at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
>>
>>   at akka.actor.ActorCell.invoke(ActorCell.scala:486)
>>
>>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>>
>>   at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>>
>>   at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
>>
>>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>
>>   at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>
>>   at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>
>>   at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>
>> 2017-11-16 19:14:40,227 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Try to restart or fail the job KafkaDemo (env:production) (3de8f35a1af689237e3c5c94023aba3f) if no longer possible.
>>
>> 2017-11-16 19:14:40,227 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job KafkaDemo (env:production) (3de8f35a1af689237e3c5c94023aba3f) switched from state FAILING to RESTARTING.
>>
>> 2017-11-16 19:14:40,227 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Restarting the job KafkaDemo (env:production) (3de8f35a1af689237e3c5c94023aba3f).
>>
>> 2017-11-16 19:14:40,229 INFO  org.apache.flink.runtime.instance.InstanceManager             - Unregistered task manager fps-flink-taskmanager-701246518-rb4k6/10.225.131.3 <https://urldefense.proofpoint.com/v2/url?u=http-3A__10.225.131.3&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=nAwcmsyRU3-7VPjrj0DiK-dDoFxIx4Nmqgrfges--HM&s=44BgPd1RZtl6w8vk14ijISKRJJoNhgGcuRxVtyn08mM&e=>. Number of registered task managers 3. Number of available slots 3.
>>
>> 2017-11-16 19:14:40,301 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416]] Caused by: [fps-flink-taskmanager-701246518-rb4k6: Name does not resolve]
>>
>> 2017-11-16 19:14:50,228 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job KafkaDemo (env:production) (3de8f35a1af689237e3c5c94023aba3f) switched from state RESTARTING to CREATED.
>>
>> 2017-11-16 19:14:50,228 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Restoring from latest valid checkpoint: Checkpoint 646 @ 1510859465870 for 3de8f35a1af689237e3c5c94023aba3f.
>>
>> 2017-11-16 19:14:50,233 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - No master state to restore
>>
>> 2017-11-16 19:14:50,233 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job KafkaDemo (env:production) (3de8f35a1af689237e3c5c94023aba3f) switched from state CREATED to RUNNING.
>>
>> 2017-11-16 19:14:50,233 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1) (6ae48dd06a2a48dce68a3ea83c21a462) switched from CREATED to SCHEDULED.
>>
>> 2017-11-16 19:14:50,234 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1) (6ae48dd06a2a48dce68a3ea83c21a462) switched from SCHEDULED to DEPLOYING.
>>
>> 2017-11-16 19:14:50,234 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Deploying Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1) (attempt #1) to fps-flink-taskmanager-701246518-v9vjx
>>
>> 2017-11-16 19:14:50,244 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416]] Caused by: [fps-flink-taskmanager-701246518-rb4k6]
>>
>> 2017-11-16 19:14:50,987 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1) (6ae48dd06a2a48dce68a3ea83c21a462) switched from DEPLOYING to RUNNING.
>>
>> 2017-11-16 19:15:24,480 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering checkpoint 649 @ 1510859724479
>>
>> 2017-11-16 19:15:25,638 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Completed checkpoint 649 (6724 bytes in 928 ms).
>>
>> 2017-11-16 19:16:22,205 WARN  Remoting                                                      - Tried to associate with unreachable remote address [akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: [The remote system has quarantined this system. No further associations to the remote system are possible until this system is restarted.]
>>
>> 2017-11-16 19:16:50,233 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering checkpoint 648 @ 1510859810233
>>
>> 2017-11-16 19:17:12,972 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Received message for non-existing checkpoint 647
>>
>> 2017-11-16 19:17:13,685 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registering TaskManager at fps-flink-taskmanager-701246518-rb4k6/10.225.131.3 <https://urldefense.proofpoint.com/v2/url?u=http-3A__10.225.131.3&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=nAwcmsyRU3-7VPjrj0DiK-dDoFxIx4Nmqgrfges--HM&s=44BgPd1RZtl6w8vk14ijISKRJJoNhgGcuRxVtyn08mM&e=> which was marked as dead earlier because of a heart-beat timeout.
>>
>> 2017-11-16 19:17:13,685 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at fps-flink-taskmanager-701246518-rb4k6 (akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416/user/taskmanager) as 8905a72ca2655b9ed0bddb50ff3c49a9. Current number of registered hosts is 4. Current number of alive task slots is 4.
>>
>> 2017-11-16 19:17:24,480 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering checkpoint 650 @ 1510859844480
>>
>>
>>
>> ===== Logs from TM ======
>>
>> 2017-11-16 19:13:33,177 INFO  org.apache.flink.runtime.state.DefaultOperatorStateBackend    - DefaultOperatorStateBackend snapshot (File Stream Factory @ s3a://zendesk-usw2-fraud-prevention-production/checkpoints/3de8f35a1af689237e3c5c94023aba3f, synchronous part) in thread Thread[Async calls on Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1),5,Flink Task Threads] took 1 ms.
>>
>> *2017-11-16 19:14:35,820 WARN  akka.remote.RemoteWatcher                                     - Detected unreachable: [akka.tcp://flink@fps-flink-jobmanager:6123]*
>>
>> 2017-11-16 19:14:37,525 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - TaskManager akka://flink/user/taskmanager disconnects from JobManager akka.tcp://flink@fps-flink-jobmanager:6123/user/jobmanager: JobManager is no longer reachable
>>
>> 2017-11-16 19:15:50,827 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Cancelling all computations and discarding all cached data.
>>
>> 2017-11-16 19:16:23,596 INFO  com.amazonaws.http.AmazonHttpClient                           - Unable to execute HTTP request: The target server failed to respond
>>
>> org.apache.http.NoHttpResponseException: The target server failed to respond
>>
>>   at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:95)
>>
>>   at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:62)
>>
>>   at org.apache.http.impl.io <http://org.apache.http.impl.io>.AbstractMessageParser.parse(AbstractMessageParser.java:254)
>>
>>   at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:289)
>>
>>   at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:252)
>>
>>   at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:191)
>>
>>   at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:300)
>>
>>   at com.amazonaws.http.protocol.SdkHttpRequestExecutor.doReceiveResponse(SdkHttpRequestExecutor.java:66)
>>
>>   at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:127)
>>
>>   at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:715)
>>
>>   at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:520)
>>
>>   at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
>>
>>   at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
>>
>>   at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:384)
>>
>>   at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
>>
>>   at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
>>
>>   at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:976)
>>
>>   at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:956)
>>
>>   at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:892)
>>
>>   at org.apache.hadoop.fs.s3a.S3AFileSystem.deleteUnnecessaryFakeDirectories(S3AFileSystem.java:1144)
>>
>>   at org.apache.hadoop.fs.s3a.S3AFileSystem.finishedWrite(S3AFileSystem.java:1133)
>>
>>   at org.apache.hadoop.fs.s3a.S3AOutputStream.close(S3AOutputStream.java:142)
>>
>>   at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
>>
>>   at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
>>
>>   at org.apache.flink.runtime.fs.hdfs.HadoopDataOutputStream.close(HadoopDataOutputStream.java:48)
>>
>>   at org.apache.flink.core.fs.ClosingFSDataOutputStream.close(ClosingFSDataOutputStream.java:64)
>>
>>   at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:319)
>>
>>   at org.apache.flink.runtime.checkpoint.AbstractAsyncSnapshotIOCallable.closeStreamAndGetStateHandle(AbstractAsyncSnapshotIOCallable.java:100)
>>
>>   at org.apache.flink.runtime.state.DefaultOperatorStateBackend$1.performOperation(DefaultOperatorStateBackend.java:270)
>>
>>   at org.apache.flink.runtime.state.DefaultOperatorStateBackend$1.performOperation(DefaultOperatorStateBackend.java:233)
>>
>>   at org.apache.flink.runtime.io <http://org.apache.flink.runtime.io>.async.AbstractAsyncIOCallable.call(AbstractAsyncIOCallable.java:72)
>>
>>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>
>>   at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:40)
>>
>>   at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:906)
>>
>>   at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>
>>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>
>>   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>
>>   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>
>>   at java.lang.Thread.run(Thread.java:748)
>>
>> 2017-11-16 19:16:35,816 INFO  org.apache.flink.runtime.taskmanager.Task                     - Attempting to fail task externally Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1) (484ebabbd13dce5e8503d88005bcdb6c).
>>
>> 2017-11-16 19:16:35,819 WARN  Remoting                                                      - Tried to associate with unreachable remote address [akka.tcp://flink@fps-flink-jobmanager:6123]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: [The remote system has a UID that has been quarantined. Association aborted.]
>>
>> *2017-11-16 19:16:45,817 INFO  org.apache.flink.runtime.taskmanager.Task                     - Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1) (484ebabbd13dce5e8503d88005bcdb6c) switched from RUNNING to FAILED.*
>>
>> *java.lang.Exception: TaskManager akka://flink/user/taskmanager disconnects from JobManager akka.tcp://flink@fps-flink-jobmanager:6123/user/jobmanager: JobManager is no longer reachable*
>>
>>   at org.apache.flink.runtime.taskmanager.TaskManager.handleJobManagerDisconnect(TaskManager.scala:1095)
>>
>>   at org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$handleMessage$1.applyOrElse(TaskManager.scala:311)
>>
>>   at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>>
>>   at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49)
>>
>>   at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>>
>>   at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>>
>>   at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>>
>>   at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>>
>>   at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>>
>>   at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
>>
>>   at org.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.scala:120)
>>
>>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>
>>   at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:44)
>>
>>   at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
>>
>>   at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
>>
>>   at akka.actor.ActorCell.invoke(ActorCell.scala:486)
>>
>>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>>
>>   at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>>
>>   at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
>>
>>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>
>>   at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>
>>   at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>
>>   at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>
>> 2017-11-16 19:16:55,815 INFO  org.apache.flink.runtime.state.DefaultOperatorStateBackend    - DefaultOperatorStateBackend snapshot (File Stream Factory @ s3a://zendesk-usw2-fraud-prevention-production/checkpoints/3de8f35a1af689237e3c5c94023aba3f, asynchronous part) in thread Thread[pool-7-thread-647,5,Flink Task Threads] took 202637 ms.
>>
>> 2017-11-16 19:16:55,816 INFO  org.apache.flink.runtime.taskmanager.Task                     - Triggering cancellation of task code Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1) (484ebabbd13dce5e8503d88005bcdb6c).
>>
>> 2017-11-16 19:17:00,825 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Disassociating from JobManager
>>
>> 2017-11-16 19:17:00,825 INFO  org.apache.flink.runtime.blob.BlobCache                       - Shutting down BlobCache
>>
>> 2017-11-16 19:17:13,677 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@fps-flink-jobmanager:6123/user/jobmanager (attempt 1, timeout: 500 milliseconds)
>>
>> *2017-11-16 19:17:13,687 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Successful registration at JobManager (akka.tcp://flink@fps-flink-jobmanager:6123/user/jobmanager), starting network stack and library cache.*
>>
>> *2017-11-16 19:17:13,687 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Determined BLOB server address to be fps-flink-jobmanager/10.231.37.240:6124 <https://urldefense.proofpoint.com/v2/url?u=http-3A__10.231.37.240-3A6124&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=nAwcmsyRU3-7VPjrj0DiK-dDoFxIx4Nmqgrfges--HM&s=FhAuAaOIb0r_xZ86VZMb-2RM8mDSGnbqVx3dsd5e9Hw&e=>. Starting BLOB cache.*
>>
>> *2017-11-16 19:17:13,687 INFO  org.apache.flink.runtime.blob.BlobCache                       - Created BLOB cache storage directory /tmp/blobStore-133033d7-a3c1-4ffb-b360-3d0efdc355aa*
>>
>> 2017-11-16 19:17:31,097 WARN  org.apache.flink.runtime.taskmanager.Task                     - Task 'Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1)' did not react to cancelling signal, but is stuck in method:
>>
>>  java.util.regex.Pattern.matcher(Pattern.java:1093)
>>
>> java.lang.String.replace(String.java:2239)
>>
>> com.trueaccord.scalapb.textformat.TextFormatUtils$.escapeDoubleQuotesAndBackslashes(TextFormatUtils.scala:160)
>>
>> com.trueaccord.scalapb.textformat.TextGenerator.addMaybeEscape(TextGenerator.scala:29)
>>
>> com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:74)
>>
>> com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:41)
>>
>> com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:28)
>>
>> com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)
>>
>> com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)
>>
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>
>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>>
>> com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)
>>
>> com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:70)
>>
>> com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:37)
>>
>> com.trueaccord.scalapb.textformat.Printer$$anonfun$printField$1.apply(Printer.scala:25)
>>
>> com.trueaccord.scalapb.textformat.Printer$$anonfun$printField$1.apply(Printer.scala:25)
>>
>> scala.collection.Iterator$class.foreach(Iterator.scala:742)
>>
>> scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>>
>> scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>>
>> scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>>
>> com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:25)
>>
>> com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)
>>
>> com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)
>>
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>
>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>>
>> com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)
>>
>> com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:70)
>>
>> com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:37)
>>
>> com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:28)
>>
>> com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)
>>
>> com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)
>>
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>
>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>>
>> com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)
>>
>> com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:8)
>>
>> com.trueaccord.scalapb.textformat.Printer$.printToString(Printer.scala:19)
>>
>> com.trueaccord.scalapb.TextFormat$.printToUnicodeString(TextFormat.scala:34)
>>
>> com.zendesk.fraud_prevention.data_types.MaxwellEvent.toString(MaxwellEvent.scala:212)
>>
>> java.lang.String.valueOf(String.java:2994)
>>
>> scala.collection.mutable.StringBuilder.append(StringBuilder.scala:200)
>>
>> scala.collection.TraversableOnce$$anonfun$addString$1.apply(TraversableOnce.scala:362)
>>
>> scala.collection.immutable.List.foreach(List.scala:381)
>>
>> scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:355)
>>
>> scala.collection.AbstractTraversable.addString(Traversable.scala:104)
>>
>> scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:321)
>>
>> scala.collection.AbstractTraversable.mkString(Traversable.scala:104)
>>
>> scala.collection.TraversableLike$class.toString(TraversableLike.scala:645)
>>
>> scala.collection.SeqLike$class.toString(SeqLike.scala:682)
>>
>> scala.collection.AbstractSeq.toString(Seq.scala:41)
>>
>> java.lang.String.valueOf(String.java:2994)
>>
>> java.lang.StringBuilder.append(StringBuilder.java:131)
>>
>> scala.StringContext.standardInterpolator(StringContext.scala:125)
>>
>> scala.StringContext.s(StringContext.scala:95)
>>
>> com.zendesk.fraud_prevention.functions.MaxwellFilterFlatMapper.flatMap(MaxwellFilterFlatMapper.scala:12)
>>
>> com.zendesk.fraud_prevention.functions.MaxwellFilterFlatMapper.flatMap(MaxwellFilterFlatMapper.scala:10)
>>
>> org.apache.flink.streaming.api.operators.StreamFlatMap.processElement(StreamFlatMap.java:50)
>>
>> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:528)
>>
>> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:503)
>>
>> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:483)
>>
>> org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:891)
>>
>> org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:869)
>>
>> org.apache.flink.streaming.api.operators.StreamSourceContexts$ManualWatermarkContext.processAndCollect(StreamSourceContexts.java:304)
>>
>> org.apache.flink.streaming.api.operators.StreamSourceContexts$WatermarkContext.collect(StreamSourceContexts.java:393)
>>
>> org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher.emitRecord(AbstractFetcher.java:233)
>>
>> org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.emitRecord(Kafka09Fetcher.java:193)
>>
>> org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.runFetchLoop(Kafka09Fetcher.java:152)
>>
>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.run(FlinkKafkaConsumerBase.java:483)
>>
>> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:87)
>>
>> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:55)
>>
>> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:95)
>>
>> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:263)
>>
>> org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)
>>
>> java.lang.Thread.run(Thread.java:748)
>>
>>
>>
>> 2017-11-16 19:17:31,106 WARN  org.apache.flink.runtime.taskmanager.Task                     - Task 'Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1)' did not react to cancelling signal, but is stuck in method:
>>
>>  com.trueaccord.scalapb.textformat.TextGenerator.add(TextGenerator.scala:19)
>>
>> com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:33)
>>
>> com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:28)
>>
>> com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)
>>
>> com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)
>>
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>
>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>>
>> com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)
>>
>> com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:70)
>>
>> com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:37)
>>
>> com.trueaccord.scalapb.textformat.Printer$$anonfun$printField$1.apply(Printer.scala:25)
>>
>> com.trueaccord.scalapb.textformat.Printer$$anonfun$printField$1.apply(Printer.scala:25)
>>
>> scala.collection.Iterator$class.foreach(Iterator.scala:742)
>>
>> scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>>
>> scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>>
>> scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>>
>> com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:25)
>>
>> com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)
>>
>> com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)
>>
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>
>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>>
>> com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)
>>
>> com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:70)
>>
>> com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:37)
>>
>> com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:28)
>>
>> com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)
>>
>> com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)
>>
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>
>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>>
>> com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)
>>
>> com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:8)
>>
>> com.trueaccord.scalapb.textformat.Printer$.printToString(Printer.scala:19)
>>
>> com.trueaccord.scalapb.TextFormat$.printToUnicodeString(TextFormat.scala:34)
>>
>> com.zendesk.fraud_prevention.data_types.MaxwellEvent.toString(MaxwellEvent.scala:212)
>>
>> java.lang.String.valueOf(String.java:2994)
>>
>> scala.collection.mutable.StringBuilder.append(StringBuilder.scala:200)
>>
>> scala.collection.TraversableOnce$$anonfun$addString$1.apply(TraversableOnce.scala:362)
>>
>> scala.collection.immutable.List.foreach(List.scala:381)
>>
>> scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:355)
>>
>> scala.collection.AbstractTraversable.addString(Traversable.scala:104)
>>
>> scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:321)
>>
>> scala.collection.AbstractTraversable.mkString(Traversable.scala:104)
>>
>> scala.collection.TraversableLike$class.toString(TraversableLike.scala:645)
>>
>> scala.collection.SeqLike$class.toString(SeqLike.scala:682)
>>
>> scala.collection.AbstractSeq.toString(Seq.scala:41)
>>
>> java.lang.String.valueOf(String.java:2994)
>>
>> java.lang.StringBuilder.append(StringBuilder.java:131)
>>
>> scala.StringContext.standardInterpolator(StringContext.scala:125)
>>
>> scala.StringContext.s(StringContext.scala:95)
>>
>> com.zendesk.fraud_prevention.functions.MaxwellFilterFlatMapper.flatMap(MaxwellFilterFlatMapper.scala:12)
>>
>> com.zendesk.fraud_prevention.functions.MaxwellFilterFlatMapper.flatMap(MaxwellFilterFlatMapper.scala:10)
>>
>> org.apache.flink.streaming.api.operators.StreamFlatMap.processElement(StreamFlatMap.java:50)
>>
>> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:528)
>>
>> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:503)
>>
>> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:483)
>>
>> org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:891)
>>
>> org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:869)
>>
>> org.apache.flink.streaming.api.operators.StreamSourceContexts$ManualWatermarkContext.processAndCollect(StreamSourceContexts.java:304)
>>
>> org.apache.flink.streaming.api.operators.StreamSourceContexts$WatermarkContext.collect(StreamSourceContexts.java:393)
>>
>> org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher.emitRecord(AbstractFetcher.java:233)
>>
>> org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.emitRecord(Kafka09Fetcher.java:193)
>>
>> org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.runFetchLoop(Kafka09Fetcher.java:152)
>>
>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.run(FlinkKafkaConsumerBase.java:483)
>>
>> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:87)
>>
>> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:55)
>>
>> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:95)
>>
>> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:263)
>>
>> org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)
>>
>> java.lang.Thread.run(Thread.java:748)
>>
>>
>>
>> 2017-11-16 19:18:05,486 WARN  org.apache.flink.runtime.taskmanager.Task                     - Task 'Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1)' did not react to cancelling signal, but is stuck in method:
>>
>>  java.lang.String.replace(String.java:2240)
>>
>> com.trueaccord.scalapb.textformat.TextGenerator.addMaybeEscape(TextGenerator.scala:29)
>>
>> com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:74)
>>
>> com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:41)
>>
>> com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:28)
>>
>> com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)
>>
>> com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)
>>
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>
>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>>
>> com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)
>>
>> com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:70)
>>
>> com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:37)
>>
>> com.trueaccord.scalapb.textformat.Printer$$anonfun$printField$1.apply(Printer.scala:25)
>>
>> com.trueaccord.scalapb.textformat.Printer$$anonfun$printField$1.apply(Printer.scala:25)
>>
>> scala.collection.Iterator$class.foreach(Iterator.scala:742)
>>
>> scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>>
>> scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>>
>> scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>>
>> com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:25)
>>
>> com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)
>>
>> com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)
>>
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>
>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>>
>> com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)
>>
>> com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:70)
>>
>> com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:37)
>>
>> com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:28)
>>
>> com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)
>>
>> com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)
>>
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>
>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>>
>> com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)
>>
>> com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:8)
>>
>> com.trueaccord.scalapb.textformat.Printer$.printToString(Printer.scala:19)
>>
>> com.trueaccord.scalapb.TextFormat$.printToUnicodeString(TextFormat.scala:34)
>>
>> com.zendesk.fraud_prevention.data_types.MaxwellEvent.toString(MaxwellEvent.scala:212)
>>
>> java.lang.String.valueOf(String.java:2994)
>>
>> java.lang.StringBuilder.append(StringBuilder.java:131)
>>
>> scala.StringContext.standardInterpolator(StringContext.scala:125)
>>
>> scala.StringContext.s(StringContext.scala:95)
>>
>> com.zendesk.fraud_prevention.functions.MaxwellFilterFlatMapper$$anonfun$flatMap$1.apply(MaxwellFilterFlatMapper.scala:14)
>>
>> com.zendesk.fraud_prevention.functions.MaxwellFilterFlatMapper$$anonfun$flatMap$1.apply(MaxwellFilterFlatMapper.scala:13)
>>
>> scala.collection.immutable.List.foreach(List.scala:381)
>>
>> com.zendesk.fraud_prevention.functions.MaxwellFilterFlatMapper.flatMap(MaxwellFilterFlatMapper.scala:13)
>>
>> com.zendesk.fraud_prevention.functions.MaxwellFilterFlatMapper.flatMap(MaxwellFilterFlatMapper.scala:10)
>>
>> org.apache.flink.streaming.api.operators.StreamFlatMap.processElement(StreamFlatMap.java:50)
>>
>> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:528)
>>
>> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:503)
>>
>> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:483)
>>
>> org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:891)
>>
>> org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:869)
>>
>> org.apache.flink.streaming.api.operators.StreamSourceContexts$ManualWatermarkContext.processAndCollect(StreamSourceContexts.java:304)
>>
>> org.apache.flink.streaming.api.operators.StreamSourceContexts$WatermarkContext.collect(StreamSourceContexts.java:393)
>>
>> org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher.emitRecord(AbstractFetcher.java:233)
>>
>> org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.emitRecord(Kafka09Fetcher.java:193)
>>
>> org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.runFetchLoop(Kafka09Fetcher.java:152)
>>
>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.run(FlinkKafkaConsumerBase.java:483)
>>
>> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:87)
>>
>> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:55)
>>
>> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:95)
>>
>> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:263)
>>
>> org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)
>>
>> java.lang.Thread.run(Thread.java:748)
>>
>>
>>
>> 2017-11-16 19:18:34,375 INFO  org.apache.flink.runtime.taskmanager.Task                     - Freeing task resources for Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1) (484ebabbd13dce5e8503d88005bcdb6c).
>>
>> 2017-11-16 19:18:36,257 INFO  org.apache.flink.runtime.taskmanager.Task                     - Ensuring all FileSystem streams are closed for task Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1) (484ebabbd13dce5e8503d88005bcdb6c) [FAILED]
>>
>> 2017-11-16 19:18:36,618 ERROR org.apache.flink.runtime.taskmanager.TaskManager              - Cannot find task with ID 484ebabbd13dce5e8503d88005bcdb6c to unregister.
>>
>>
>

Re: Task manager suddenly lost connection to JM

Posted by Stephan Ewen <se...@apache.org>.
We recently observed something like that in some tests and the problem was
the following:

A Netty dependency pulled in via Hadoop or ZooKeeper conflicted with Akka's
Netty dependency, which lead to remote connection failures.

In Flink 1.4, we fix that by shading Akka's Netty to make sure this cannot
happen.

If that is in fact the problem you see, in Flink 1.3.2, you need to exclude
Hadoop's Netty / ZooKeeper's Netty from the classpath.

Best,
Stephan


On Mon, Nov 20, 2017 at 6:02 AM, Chan, Regina <Re...@gs.com> wrote:

> I have a similar problem where I lose Task Managers. I originally thought
> it had to do with memory issues but it doesn’t look that’s the case… Any
> ideas here? Am I missing something obvious?
>
>
>
> 11/19/2017 23:43:51     CHAIN DataSource (at createInput(ExecutionEnvironment.java:553)
> (org.apache.flink.api.java.hadoop.mapreduce.HadoopInputFormat)) -> Filter
> (Filter at readParquetFile(FlinkRefinerOperations.java:899)) -> Map (Map
> at readParquetFile(FlinkRefinerOperations.java:902)) -> Map (Map at
> handleMilestoning(MergeTask.java:259))(1/1) switched to FAILED
>
> java.lang.Exception: TaskManager was lost/killed: ResourceID{resourceId='
> container_e5420_1511132977419_37521_01_000027'} @ d173636-336.dc.gs.com
> (dataPort=45003)
>
>         at org.apache.flink.runtime.instance.SimpleSlot.
> releaseSlot(SimpleSlot.java:217)
>
>         at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.
> releaseSharedSlot(SlotSharingGroupAssignment.java:533)
>
>         at org.apache.flink.runtime.instance.SharedSlot.
> releaseSlot(SharedSlot.java:192)
>
>         at org.apache.flink.runtime.instance.Instance.markDead(
> Instance.java:167)
>
>         at org.apache.flink.runtime.instance.InstanceManager.
> unregisterTaskManager(InstanceManager.java:212)
>
>
>
>
>
> Thanks,
>
> Regina
>
>
>
> *From:* Hao Sun [mailto:hasun@zendesk.com]
> *Sent:* Thursday, November 16, 2017 5:37 PM
> *To:* user
> *Subject:* Task manager suddenly lost connection to JM
>
>
>
> Hi team, I see an wired issue that one of my TM suddenly lost connection
> to JM.
>
> Once the job running on the TM relocated to a new TM, it can reconnect to
> JM again.
>
> And after a while, the new TM running the same job will repeat the same
> process.
>
> It is not guaranteed the troubled TMs can reconnect to JM in a reasonable
> time frame, like minutes. Sometime it take days in order to reconnect
> successfully.
>
>
>
> I am using Flink 1.3.2 and Kubernetes. Is this because of network
> congestion?
>
>
>
> Thanks!
>
>
>
> ===== Logs from JM ======
>
> *2017-11-16 19:14:40,216 WARN  akka.remote.RemoteWatcher*                                     - Detected unreachable: [akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416]
>
> 2017-11-16 19:14:40,218 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Task manager akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416/user/taskmanager terminated.
>
> 2017-11-16 19:14:40,219 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1) (484ebabbd13dce5e8503d88005bcdb6c) switched from RUNNING to FAILED.
>
> java.lang.Exception: TaskManager was lost/killed: 50cae001c1d97e55889a6051319f4746 @ fps-flink-taskmanager-701246518-rb4k6 (dataPort=43871)
>
>   at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:217)
>
>   at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:533)
>
>   at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:192)
>
>   at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:167)
>
>   at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:212)
>
>   at org.apache.flink.runtime.jobmanager.JobManager.org <https://urldefense.proofpoint.com/v2/url?u=http-3A__org.apache.flink.runtime.jobmanager.JobManager.org&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=nAwcmsyRU3-7VPjrj0DiK-dDoFxIx4Nmqgrfges--HM&s=SXhWPJc2OVtNUwdXddIvhlcMhDOb-C08U42xcujhvss&e=>$apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated(JobManager.scala:1228)
>
>   at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:1131)
>
>   at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>
>   at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49)
>
>   at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>
>   at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>
>   at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>
>   at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>
>   at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>
>   at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
>
>   at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:125)
>
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>
>   at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:44)
>
>   at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
>
>   at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
>
>   at akka.actor.ActorCell.invoke(ActorCell.scala:486)
>
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>
>   at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>
>   at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
>
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>
>   at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>
>   at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>
>   at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
> *2017-11-16 19:14:40,219 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job KafkaDemo (env:production) (3de8f35a1af689237e3c5c94023aba3f) switched from state RUNNING to FAILING.*
>
> *java.lang.Exception: TaskManager was lost/killed: 50cae001c1d97e55889a6051319f4746 @ fps-flink-taskmanager-701246518-rb4k6 (dataPort=43871)*
>
>   at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:217)
>
>   at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:533)
>
>   at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:192)
>
>   at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:167)
>
>   at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:212)
>
>   at org.apache.flink.runtime.jobmanager.JobManager.org <https://urldefense.proofpoint.com/v2/url?u=http-3A__org.apache.flink.runtime.jobmanager.JobManager.org&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=nAwcmsyRU3-7VPjrj0DiK-dDoFxIx4Nmqgrfges--HM&s=SXhWPJc2OVtNUwdXddIvhlcMhDOb-C08U42xcujhvss&e=>$apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated(JobManager.scala:1228)
>
>   at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:1131)
>
>   at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>
>   at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49)
>
>   at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>
>   at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>
>   at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>
>   at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>
>   at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>
>   at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
>
>   at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:125)
>
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>
>   at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:44)
>
>   at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
>
>   at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
>
>   at akka.actor.ActorCell.invoke(ActorCell.scala:486)
>
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>
>   at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>
>   at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
>
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>
>   at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>
>   at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>
>   at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
> 2017-11-16 19:14:40,227 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Try to restart or fail the job KafkaDemo (env:production) (3de8f35a1af689237e3c5c94023aba3f) if no longer possible.
>
> 2017-11-16 19:14:40,227 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job KafkaDemo (env:production) (3de8f35a1af689237e3c5c94023aba3f) switched from state FAILING to RESTARTING.
>
> 2017-11-16 19:14:40,227 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Restarting the job KafkaDemo (env:production) (3de8f35a1af689237e3c5c94023aba3f).
>
> 2017-11-16 19:14:40,229 INFO  org.apache.flink.runtime.instance.InstanceManager             - Unregistered task manager fps-flink-taskmanager-701246518-rb4k6/10.225.131.3 <https://urldefense.proofpoint.com/v2/url?u=http-3A__10.225.131.3&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=nAwcmsyRU3-7VPjrj0DiK-dDoFxIx4Nmqgrfges--HM&s=44BgPd1RZtl6w8vk14ijISKRJJoNhgGcuRxVtyn08mM&e=>. Number of registered task managers 3. Number of available slots 3.
>
> 2017-11-16 19:14:40,301 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416]] Caused by: [fps-flink-taskmanager-701246518-rb4k6: Name does not resolve]
>
> 2017-11-16 19:14:50,228 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job KafkaDemo (env:production) (3de8f35a1af689237e3c5c94023aba3f) switched from state RESTARTING to CREATED.
>
> 2017-11-16 19:14:50,228 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Restoring from latest valid checkpoint: Checkpoint 646 @ 1510859465870 for 3de8f35a1af689237e3c5c94023aba3f.
>
> 2017-11-16 19:14:50,233 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - No master state to restore
>
> 2017-11-16 19:14:50,233 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job KafkaDemo (env:production) (3de8f35a1af689237e3c5c94023aba3f) switched from state CREATED to RUNNING.
>
> 2017-11-16 19:14:50,233 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1) (6ae48dd06a2a48dce68a3ea83c21a462) switched from CREATED to SCHEDULED.
>
> 2017-11-16 19:14:50,234 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1) (6ae48dd06a2a48dce68a3ea83c21a462) switched from SCHEDULED to DEPLOYING.
>
> 2017-11-16 19:14:50,234 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Deploying Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1) (attempt #1) to fps-flink-taskmanager-701246518-v9vjx
>
> 2017-11-16 19:14:50,244 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416]] Caused by: [fps-flink-taskmanager-701246518-rb4k6]
>
> 2017-11-16 19:14:50,987 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1) (6ae48dd06a2a48dce68a3ea83c21a462) switched from DEPLOYING to RUNNING.
>
> 2017-11-16 19:15:24,480 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering checkpoint 649 @ 1510859724479
>
> 2017-11-16 19:15:25,638 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Completed checkpoint 649 (6724 bytes in 928 ms).
>
> 2017-11-16 19:16:22,205 WARN  Remoting                                                      - Tried to associate with unreachable remote address [akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: [The remote system has quarantined this system. No further associations to the remote system are possible until this system is restarted.]
>
> 2017-11-16 19:16:50,233 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering checkpoint 648 @ 1510859810233
>
> 2017-11-16 19:17:12,972 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Received message for non-existing checkpoint 647
>
> 2017-11-16 19:17:13,685 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registering TaskManager at fps-flink-taskmanager-701246518-rb4k6/10.225.131.3 <https://urldefense.proofpoint.com/v2/url?u=http-3A__10.225.131.3&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=nAwcmsyRU3-7VPjrj0DiK-dDoFxIx4Nmqgrfges--HM&s=44BgPd1RZtl6w8vk14ijISKRJJoNhgGcuRxVtyn08mM&e=> which was marked as dead earlier because of a heart-beat timeout.
>
> 2017-11-16 19:17:13,685 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at fps-flink-taskmanager-701246518-rb4k6 (akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416/user/taskmanager) as 8905a72ca2655b9ed0bddb50ff3c49a9. Current number of registered hosts is 4. Current number of alive task slots is 4.
>
> 2017-11-16 19:17:24,480 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering checkpoint 650 @ 1510859844480
>
>
>
> ===== Logs from TM ======
>
> 2017-11-16 19:13:33,177 INFO  org.apache.flink.runtime.state.DefaultOperatorStateBackend    - DefaultOperatorStateBackend snapshot (File Stream Factory @ s3a://zendesk-usw2-fraud-prevention-production/checkpoints/3de8f35a1af689237e3c5c94023aba3f, synchronous part) in thread Thread[Async calls on Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1),5,Flink Task Threads] took 1 ms.
>
> *2017-11-16 19:14:35,820 WARN  akka.remote.RemoteWatcher                                     - Detected unreachable: [akka.tcp://flink@fps-flink-jobmanager:6123]*
>
> 2017-11-16 19:14:37,525 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - TaskManager akka://flink/user/taskmanager disconnects from JobManager akka.tcp://flink@fps-flink-jobmanager:6123/user/jobmanager: JobManager is no longer reachable
>
> 2017-11-16 19:15:50,827 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Cancelling all computations and discarding all cached data.
>
> 2017-11-16 19:16:23,596 INFO  com.amazonaws.http.AmazonHttpClient                           - Unable to execute HTTP request: The target server failed to respond
>
> org.apache.http.NoHttpResponseException: The target server failed to respond
>
>   at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:95)
>
>   at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:62)
>
>   at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:254)
>
>   at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:289)
>
>   at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:252)
>
>   at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:191)
>
>   at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:300)
>
>   at com.amazonaws.http.protocol.SdkHttpRequestExecutor.doReceiveResponse(SdkHttpRequestExecutor.java:66)
>
>   at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:127)
>
>   at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:715)
>
>   at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:520)
>
>   at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
>
>   at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
>
>   at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:384)
>
>   at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
>
>   at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
>
>   at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:976)
>
>   at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:956)
>
>   at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:892)
>
>   at org.apache.hadoop.fs.s3a.S3AFileSystem.deleteUnnecessaryFakeDirectories(S3AFileSystem.java:1144)
>
>   at org.apache.hadoop.fs.s3a.S3AFileSystem.finishedWrite(S3AFileSystem.java:1133)
>
>   at org.apache.hadoop.fs.s3a.S3AOutputStream.close(S3AOutputStream.java:142)
>
>   at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
>
>   at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
>
>   at org.apache.flink.runtime.fs.hdfs.HadoopDataOutputStream.close(HadoopDataOutputStream.java:48)
>
>   at org.apache.flink.core.fs.ClosingFSDataOutputStream.close(ClosingFSDataOutputStream.java:64)
>
>   at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:319)
>
>   at org.apache.flink.runtime.checkpoint.AbstractAsyncSnapshotIOCallable.closeStreamAndGetStateHandle(AbstractAsyncSnapshotIOCallable.java:100)
>
>   at org.apache.flink.runtime.state.DefaultOperatorStateBackend$1.performOperation(DefaultOperatorStateBackend.java:270)
>
>   at org.apache.flink.runtime.state.DefaultOperatorStateBackend$1.performOperation(DefaultOperatorStateBackend.java:233)
>
>   at org.apache.flink.runtime.io.async.AbstractAsyncIOCallable.call(AbstractAsyncIOCallable.java:72)
>
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>
>   at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:40)
>
>   at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:906)
>
>   at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>
>   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>
>   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>
>   at java.lang.Thread.run(Thread.java:748)
>
> 2017-11-16 19:16:35,816 INFO  org.apache.flink.runtime.taskmanager.Task                     - Attempting to fail task externally Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1) (484ebabbd13dce5e8503d88005bcdb6c).
>
> 2017-11-16 19:16:35,819 WARN  Remoting                                                      - Tried to associate with unreachable remote address [akka.tcp://flink@fps-flink-jobmanager:6123]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: [The remote system has a UID that has been quarantined. Association aborted.]
>
> *2017-11-16 19:16:45,817 INFO  org.apache.flink.runtime.taskmanager.Task                     - Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1) (484ebabbd13dce5e8503d88005bcdb6c) switched from RUNNING to FAILED.*
>
> *java.lang.Exception: TaskManager akka://flink/user/taskmanager disconnects from JobManager akka.tcp://flink@fps-flink-jobmanager:6123/user/jobmanager: JobManager is no longer reachable*
>
>   at org.apache.flink.runtime.taskmanager.TaskManager.handleJobManagerDisconnect(TaskManager.scala:1095)
>
>   at org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$handleMessage$1.applyOrElse(TaskManager.scala:311)
>
>   at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>
>   at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49)
>
>   at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>
>   at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>
>   at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>
>   at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>
>   at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>
>   at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
>
>   at org.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.scala:120)
>
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>
>   at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:44)
>
>   at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
>
>   at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
>
>   at akka.actor.ActorCell.invoke(ActorCell.scala:486)
>
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>
>   at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>
>   at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
>
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>
>   at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>
>   at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>
>   at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
> 2017-11-16 19:16:55,815 INFO  org.apache.flink.runtime.state.DefaultOperatorStateBackend    - DefaultOperatorStateBackend snapshot (File Stream Factory @ s3a://zendesk-usw2-fraud-prevention-production/checkpoints/3de8f35a1af689237e3c5c94023aba3f, asynchronous part) in thread Thread[pool-7-thread-647,5,Flink Task Threads] took 202637 ms.
>
> 2017-11-16 19:16:55,816 INFO  org.apache.flink.runtime.taskmanager.Task                     - Triggering cancellation of task code Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1) (484ebabbd13dce5e8503d88005bcdb6c).
>
> 2017-11-16 19:17:00,825 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Disassociating from JobManager
>
> 2017-11-16 19:17:00,825 INFO  org.apache.flink.runtime.blob.BlobCache                       - Shutting down BlobCache
>
> 2017-11-16 19:17:13,677 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@fps-flink-jobmanager:6123/user/jobmanager (attempt 1, timeout: 500 milliseconds)
>
> *2017-11-16 19:17:13,687 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Successful registration at JobManager (akka.tcp://flink@fps-flink-jobmanager:6123/user/jobmanager), starting network stack and library cache.*
>
> *2017-11-16 19:17:13,687 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Determined BLOB server address to be fps-flink-jobmanager/10.231.37.240:6124 <https://urldefense.proofpoint.com/v2/url?u=http-3A__10.231.37.240-3A6124&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=nAwcmsyRU3-7VPjrj0DiK-dDoFxIx4Nmqgrfges--HM&s=FhAuAaOIb0r_xZ86VZMb-2RM8mDSGnbqVx3dsd5e9Hw&e=>. Starting BLOB cache.*
>
> *2017-11-16 19:17:13,687 INFO  org.apache.flink.runtime.blob.BlobCache                       - Created BLOB cache storage directory /tmp/blobStore-133033d7-a3c1-4ffb-b360-3d0efdc355aa*
>
> 2017-11-16 19:17:31,097 WARN  org.apache.flink.runtime.taskmanager.Task                     - Task 'Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1)' did not react to cancelling signal, but is stuck in method:
>
>  java.util.regex.Pattern.matcher(Pattern.java:1093)
>
> java.lang.String.replace(String.java:2239)
>
> com.trueaccord.scalapb.textformat.TextFormatUtils$.escapeDoubleQuotesAndBackslashes(TextFormatUtils.scala:160)
>
> com.trueaccord.scalapb.textformat.TextGenerator.addMaybeEscape(TextGenerator.scala:29)
>
> com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:74)
>
> com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:41)
>
> com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:28)
>
> com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)
>
> com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)
>
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>
> com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)
>
> com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:70)
>
> com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:37)
>
> com.trueaccord.scalapb.textformat.Printer$$anonfun$printField$1.apply(Printer.scala:25)
>
> com.trueaccord.scalapb.textformat.Printer$$anonfun$printField$1.apply(Printer.scala:25)
>
> scala.collection.Iterator$class.foreach(Iterator.scala:742)
>
> scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>
> scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>
> scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>
> com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:25)
>
> com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)
>
> com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)
>
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>
> com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)
>
> com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:70)
>
> com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:37)
>
> com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:28)
>
> com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)
>
> com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)
>
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>
> com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)
>
> com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:8)
>
> com.trueaccord.scalapb.textformat.Printer$.printToString(Printer.scala:19)
>
> com.trueaccord.scalapb.TextFormat$.printToUnicodeString(TextFormat.scala:34)
>
> com.zendesk.fraud_prevention.data_types.MaxwellEvent.toString(MaxwellEvent.scala:212)
>
> java.lang.String.valueOf(String.java:2994)
>
> scala.collection.mutable.StringBuilder.append(StringBuilder.scala:200)
>
> scala.collection.TraversableOnce$$anonfun$addString$1.apply(TraversableOnce.scala:362)
>
> scala.collection.immutable.List.foreach(List.scala:381)
>
> scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:355)
>
> scala.collection.AbstractTraversable.addString(Traversable.scala:104)
>
> scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:321)
>
> scala.collection.AbstractTraversable.mkString(Traversable.scala:104)
>
> scala.collection.TraversableLike$class.toString(TraversableLike.scala:645)
>
> scala.collection.SeqLike$class.toString(SeqLike.scala:682)
>
> scala.collection.AbstractSeq.toString(Seq.scala:41)
>
> java.lang.String.valueOf(String.java:2994)
>
> java.lang.StringBuilder.append(StringBuilder.java:131)
>
> scala.StringContext.standardInterpolator(StringContext.scala:125)
>
> scala.StringContext.s(StringContext.scala:95)
>
> com.zendesk.fraud_prevention.functions.MaxwellFilterFlatMapper.flatMap(MaxwellFilterFlatMapper.scala:12)
>
> com.zendesk.fraud_prevention.functions.MaxwellFilterFlatMapper.flatMap(MaxwellFilterFlatMapper.scala:10)
>
> org.apache.flink.streaming.api.operators.StreamFlatMap.processElement(StreamFlatMap.java:50)
>
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:528)
>
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:503)
>
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:483)
>
> org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:891)
>
> org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:869)
>
> org.apache.flink.streaming.api.operators.StreamSourceContexts$ManualWatermarkContext.processAndCollect(StreamSourceContexts.java:304)
>
> org.apache.flink.streaming.api.operators.StreamSourceContexts$WatermarkContext.collect(StreamSourceContexts.java:393)
>
> org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher.emitRecord(AbstractFetcher.java:233)
>
> org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.emitRecord(Kafka09Fetcher.java:193)
>
> org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.runFetchLoop(Kafka09Fetcher.java:152)
>
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.run(FlinkKafkaConsumerBase.java:483)
>
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:87)
>
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:55)
>
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:95)
>
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:263)
>
> org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)
>
> java.lang.Thread.run(Thread.java:748)
>
>
>
> 2017-11-16 19:17:31,106 WARN  org.apache.flink.runtime.taskmanager.Task                     - Task 'Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1)' did not react to cancelling signal, but is stuck in method:
>
>  com.trueaccord.scalapb.textformat.TextGenerator.add(TextGenerator.scala:19)
>
> com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:33)
>
> com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:28)
>
> com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)
>
> com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)
>
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>
> com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)
>
> com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:70)
>
> com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:37)
>
> com.trueaccord.scalapb.textformat.Printer$$anonfun$printField$1.apply(Printer.scala:25)
>
> com.trueaccord.scalapb.textformat.Printer$$anonfun$printField$1.apply(Printer.scala:25)
>
> scala.collection.Iterator$class.foreach(Iterator.scala:742)
>
> scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>
> scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>
> scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>
> com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:25)
>
> com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)
>
> com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)
>
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>
> com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)
>
> com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:70)
>
> com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:37)
>
> com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:28)
>
> com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)
>
> com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)
>
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>
> com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)
>
> com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:8)
>
> com.trueaccord.scalapb.textformat.Printer$.printToString(Printer.scala:19)
>
> com.trueaccord.scalapb.TextFormat$.printToUnicodeString(TextFormat.scala:34)
>
> com.zendesk.fraud_prevention.data_types.MaxwellEvent.toString(MaxwellEvent.scala:212)
>
> java.lang.String.valueOf(String.java:2994)
>
> scala.collection.mutable.StringBuilder.append(StringBuilder.scala:200)
>
> scala.collection.TraversableOnce$$anonfun$addString$1.apply(TraversableOnce.scala:362)
>
> scala.collection.immutable.List.foreach(List.scala:381)
>
> scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:355)
>
> scala.collection.AbstractTraversable.addString(Traversable.scala:104)
>
> scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:321)
>
> scala.collection.AbstractTraversable.mkString(Traversable.scala:104)
>
> scala.collection.TraversableLike$class.toString(TraversableLike.scala:645)
>
> scala.collection.SeqLike$class.toString(SeqLike.scala:682)
>
> scala.collection.AbstractSeq.toString(Seq.scala:41)
>
> java.lang.String.valueOf(String.java:2994)
>
> java.lang.StringBuilder.append(StringBuilder.java:131)
>
> scala.StringContext.standardInterpolator(StringContext.scala:125)
>
> scala.StringContext.s(StringContext.scala:95)
>
> com.zendesk.fraud_prevention.functions.MaxwellFilterFlatMapper.flatMap(MaxwellFilterFlatMapper.scala:12)
>
> com.zendesk.fraud_prevention.functions.MaxwellFilterFlatMapper.flatMap(MaxwellFilterFlatMapper.scala:10)
>
> org.apache.flink.streaming.api.operators.StreamFlatMap.processElement(StreamFlatMap.java:50)
>
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:528)
>
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:503)
>
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:483)
>
> org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:891)
>
> org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:869)
>
> org.apache.flink.streaming.api.operators.StreamSourceContexts$ManualWatermarkContext.processAndCollect(StreamSourceContexts.java:304)
>
> org.apache.flink.streaming.api.operators.StreamSourceContexts$WatermarkContext.collect(StreamSourceContexts.java:393)
>
> org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher.emitRecord(AbstractFetcher.java:233)
>
> org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.emitRecord(Kafka09Fetcher.java:193)
>
> org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.runFetchLoop(Kafka09Fetcher.java:152)
>
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.run(FlinkKafkaConsumerBase.java:483)
>
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:87)
>
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:55)
>
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:95)
>
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:263)
>
> org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)
>
> java.lang.Thread.run(Thread.java:748)
>
>
>
> 2017-11-16 19:18:05,486 WARN  org.apache.flink.runtime.taskmanager.Task                     - Task 'Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1)' did not react to cancelling signal, but is stuck in method:
>
>  java.lang.String.replace(String.java:2240)
>
> com.trueaccord.scalapb.textformat.TextGenerator.addMaybeEscape(TextGenerator.scala:29)
>
> com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:74)
>
> com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:41)
>
> com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:28)
>
> com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)
>
> com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)
>
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>
> com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)
>
> com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:70)
>
> com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:37)
>
> com.trueaccord.scalapb.textformat.Printer$$anonfun$printField$1.apply(Printer.scala:25)
>
> com.trueaccord.scalapb.textformat.Printer$$anonfun$printField$1.apply(Printer.scala:25)
>
> scala.collection.Iterator$class.foreach(Iterator.scala:742)
>
> scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>
> scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>
> scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>
> com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:25)
>
> com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)
>
> com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)
>
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>
> com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)
>
> com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:70)
>
> com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:37)
>
> com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:28)
>
> com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)
>
> com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)
>
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>
> com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)
>
> com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:8)
>
> com.trueaccord.scalapb.textformat.Printer$.printToString(Printer.scala:19)
>
> com.trueaccord.scalapb.TextFormat$.printToUnicodeString(TextFormat.scala:34)
>
> com.zendesk.fraud_prevention.data_types.MaxwellEvent.toString(MaxwellEvent.scala:212)
>
> java.lang.String.valueOf(String.java:2994)
>
> java.lang.StringBuilder.append(StringBuilder.java:131)
>
> scala.StringContext.standardInterpolator(StringContext.scala:125)
>
> scala.StringContext.s(StringContext.scala:95)
>
> com.zendesk.fraud_prevention.functions.MaxwellFilterFlatMapper$$anonfun$flatMap$1.apply(MaxwellFilterFlatMapper.scala:14)
>
> com.zendesk.fraud_prevention.functions.MaxwellFilterFlatMapper$$anonfun$flatMap$1.apply(MaxwellFilterFlatMapper.scala:13)
>
> scala.collection.immutable.List.foreach(List.scala:381)
>
> com.zendesk.fraud_prevention.functions.MaxwellFilterFlatMapper.flatMap(MaxwellFilterFlatMapper.scala:13)
>
> com.zendesk.fraud_prevention.functions.MaxwellFilterFlatMapper.flatMap(MaxwellFilterFlatMapper.scala:10)
>
> org.apache.flink.streaming.api.operators.StreamFlatMap.processElement(StreamFlatMap.java:50)
>
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:528)
>
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:503)
>
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:483)
>
> org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:891)
>
> org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:869)
>
> org.apache.flink.streaming.api.operators.StreamSourceContexts$ManualWatermarkContext.processAndCollect(StreamSourceContexts.java:304)
>
> org.apache.flink.streaming.api.operators.StreamSourceContexts$WatermarkContext.collect(StreamSourceContexts.java:393)
>
> org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher.emitRecord(AbstractFetcher.java:233)
>
> org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.emitRecord(Kafka09Fetcher.java:193)
>
> org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.runFetchLoop(Kafka09Fetcher.java:152)
>
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.run(FlinkKafkaConsumerBase.java:483)
>
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:87)
>
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:55)
>
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:95)
>
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:263)
>
> org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)
>
> java.lang.Thread.run(Thread.java:748)
>
>
>
> 2017-11-16 19:18:34,375 INFO  org.apache.flink.runtime.taskmanager.Task                     - Freeing task resources for Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1) (484ebabbd13dce5e8503d88005bcdb6c).
>
> 2017-11-16 19:18:36,257 INFO  org.apache.flink.runtime.taskmanager.Task                     - Ensuring all FileSystem streams are closed for task Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1) (484ebabbd13dce5e8503d88005bcdb6c) [FAILED]
>
> 2017-11-16 19:18:36,618 ERROR org.apache.flink.runtime.taskmanager.TaskManager              - Cannot find task with ID 484ebabbd13dce5e8503d88005bcdb6c to unregister.
>
>

RE: Task manager suddenly lost connection to JM

Posted by "Chan, Regina" <Re...@gs.com>.
I have a similar problem where I lose Task Managers. I originally thought it had to do with memory issues but it doesn’t look that’s the case… Any ideas here? Am I missing something obvious?

11/19/2017 23:43:51     CHAIN DataSource (at createInput(ExecutionEnvironment.java:553) (org.apache.flink.api.java.hadoop.mapreduce.HadoopInputFormat)) -> Filter (Filter at readParquetFile(FlinkRefinerOperations.java:899)) -> Map (Map at readParquetFile(FlinkRefinerOperations.java:902)) -> Map (Map at handleMilestoning(MergeTask.java:259))(1/1) switched to FAILED
java.lang.Exception: TaskManager was lost/killed: ResourceID{resourceId='container_e5420_1511132977419_37521_01_000027'} @ d173636-336.dc.gs.com (dataPort=45003)
        at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:217)
        at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:533)
        at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:192)
        at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:167)
        at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:212)

[cid:image001.png@01D36191.E29F01A0]

Thanks,
Regina

From: Hao Sun [mailto:hasun@zendesk.com]
Sent: Thursday, November 16, 2017 5:37 PM
To: user
Subject: Task manager suddenly lost connection to JM

Hi team, I see an wired issue that one of my TM suddenly lost connection to JM.
Once the job running on the TM relocated to a new TM, it can reconnect to JM again.
And after a while, the new TM running the same job will repeat the same process.
It is not guaranteed the troubled TMs can reconnect to JM in a reasonable time frame, like minutes. Sometime it take days in order to reconnect successfully.

I am using Flink 1.3.2 and Kubernetes. Is this because of network congestion?

Thanks!


===== Logs from JM ======

2017-11-16 19:14:40,216 WARN  akka.remote.RemoteWatcher                                     - Detected unreachable: [akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416]

2017-11-16 19:14:40,218 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Task manager akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416/user/taskmanager terminated.

2017-11-16 19:14:40,219 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1) (484ebabbd13dce5e8503d88005bcdb6c) switched from RUNNING to FAILED.

java.lang.Exception: TaskManager was lost/killed: 50cae001c1d97e55889a6051319f4746 @ fps-flink-taskmanager-701246518-rb4k6 (dataPort=43871)

  at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:217)

  at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:533)

  at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:192)

  at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:167)

  at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:212)

  at org.apache.flink.runtime.jobmanager.JobManager.org<https://urldefense.proofpoint.com/v2/url?u=http-3A__org.apache.flink.runtime.jobmanager.JobManager.org&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=nAwcmsyRU3-7VPjrj0DiK-dDoFxIx4Nmqgrfges--HM&s=SXhWPJc2OVtNUwdXddIvhlcMhDOb-C08U42xcujhvss&e=>$apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated(JobManager.scala:1228)

  at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:1131)

  at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)

  at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49)

  at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)

  at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)

  at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)

  at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)

  at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)

  at akka.actor.Actor$class.aroundReceive(Actor.scala:467)

  at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:125)

  at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)

  at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:44)

  at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)

  at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)

  at akka.actor.ActorCell.invoke(ActorCell.scala:486)

  at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)

  at akka.dispatch.Mailbox.run(Mailbox.scala:220)

  at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)

  at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

  at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

  at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

  at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

2017-11-16 19:14:40,219 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job KafkaDemo (env:production) (3de8f35a1af689237e3c5c94023aba3f) switched from state RUNNING to FAILING.

java.lang.Exception: TaskManager was lost/killed: 50cae001c1d97e55889a6051319f4746 @ fps-flink-taskmanager-701246518-rb4k6 (dataPort=43871)

  at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:217)

  at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:533)

  at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:192)

  at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:167)

  at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:212)

  at org.apache.flink.runtime.jobmanager.JobManager.org<https://urldefense.proofpoint.com/v2/url?u=http-3A__org.apache.flink.runtime.jobmanager.JobManager.org&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=nAwcmsyRU3-7VPjrj0DiK-dDoFxIx4Nmqgrfges--HM&s=SXhWPJc2OVtNUwdXddIvhlcMhDOb-C08U42xcujhvss&e=>$apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated(JobManager.scala:1228)

  at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:1131)

  at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)

  at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49)

  at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)

  at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)

  at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)

  at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)

  at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)

  at akka.actor.Actor$class.aroundReceive(Actor.scala:467)

  at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:125)

  at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)

  at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:44)

  at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)

  at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)

  at akka.actor.ActorCell.invoke(ActorCell.scala:486)

  at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)

  at akka.dispatch.Mailbox.run(Mailbox.scala:220)

  at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)

  at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

  at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

  at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

  at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

2017-11-16 19:14:40,227 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Try to restart or fail the job KafkaDemo (env:production) (3de8f35a1af689237e3c5c94023aba3f) if no longer possible.

2017-11-16 19:14:40,227 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job KafkaDemo (env:production) (3de8f35a1af689237e3c5c94023aba3f) switched from state FAILING to RESTARTING.

2017-11-16 19:14:40,227 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Restarting the job KafkaDemo (env:production) (3de8f35a1af689237e3c5c94023aba3f).

2017-11-16 19:14:40,229 INFO  org.apache.flink.runtime.instance.InstanceManager             - Unregistered task manager fps-flink-taskmanager-701246518-rb4k6/10.225.131.3<https://urldefense.proofpoint.com/v2/url?u=http-3A__10.225.131.3&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=nAwcmsyRU3-7VPjrj0DiK-dDoFxIx4Nmqgrfges--HM&s=44BgPd1RZtl6w8vk14ijISKRJJoNhgGcuRxVtyn08mM&e=>. Number of registered task managers 3. Number of available slots 3.

2017-11-16 19:14:40,301 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416]] Caused by: [fps-flink-taskmanager-701246518-rb4k6: Name does not resolve]

2017-11-16 19:14:50,228 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job KafkaDemo (env:production) (3de8f35a1af689237e3c5c94023aba3f) switched from state RESTARTING to CREATED.

2017-11-16 19:14:50,228 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Restoring from latest valid checkpoint: Checkpoint 646 @ 1510859465870 for 3de8f35a1af689237e3c5c94023aba3f.

2017-11-16 19:14:50,233 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - No master state to restore

2017-11-16 19:14:50,233 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job KafkaDemo (env:production) (3de8f35a1af689237e3c5c94023aba3f) switched from state CREATED to RUNNING.

2017-11-16 19:14:50,233 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1) (6ae48dd06a2a48dce68a3ea83c21a462) switched from CREATED to SCHEDULED.

2017-11-16 19:14:50,234 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1) (6ae48dd06a2a48dce68a3ea83c21a462) switched from SCHEDULED to DEPLOYING.

2017-11-16 19:14:50,234 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Deploying Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1) (attempt #1) to fps-flink-taskmanager-701246518-v9vjx

2017-11-16 19:14:50,244 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416]] Caused by: [fps-flink-taskmanager-701246518-rb4k6]

2017-11-16 19:14:50,987 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1) (6ae48dd06a2a48dce68a3ea83c21a462) switched from DEPLOYING to RUNNING.

2017-11-16 19:15:24,480 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering checkpoint 649 @ 1510859724479

2017-11-16 19:15:25,638 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Completed checkpoint 649 (6724 bytes in 928 ms).

2017-11-16 19:16:22,205 WARN  Remoting                                                      - Tried to associate with unreachable remote address [akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: [The remote system has quarantined this system. No further associations to the remote system are possible until this system is restarted.]

2017-11-16 19:16:50,233 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering checkpoint 648 @ 1510859810233

2017-11-16 19:17:12,972 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Received message for non-existing checkpoint 647

2017-11-16 19:17:13,685 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registering TaskManager at fps-flink-taskmanager-701246518-rb4k6/10.225.131.3<https://urldefense.proofpoint.com/v2/url?u=http-3A__10.225.131.3&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=nAwcmsyRU3-7VPjrj0DiK-dDoFxIx4Nmqgrfges--HM&s=44BgPd1RZtl6w8vk14ijISKRJJoNhgGcuRxVtyn08mM&e=> which was marked as dead earlier because of a heart-beat timeout.

2017-11-16 19:17:13,685 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at fps-flink-taskmanager-701246518-rb4k6 (akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416/user/taskmanager) as 8905a72ca2655b9ed0bddb50ff3c49a9. Current number of registered hosts is 4. Current number of alive task slots is 4.

2017-11-16 19:17:24,480 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering checkpoint 650 @ 1510859844480



===== Logs from TM ======

2017-11-16 19:13:33,177 INFO  org.apache.flink.runtime.state.DefaultOperatorStateBackend    - DefaultOperatorStateBackend snapshot (File Stream Factory @ s3a://zendesk-usw2-fraud-prevention-production/checkpoints/3de8f35a1af689237e3c5c94023aba3f, synchronous part) in thread Thread[Async calls on Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1),5,Flink Task Threads] took 1 ms.

2017-11-16 19:14:35,820 WARN  akka.remote.RemoteWatcher                                     - Detected unreachable: [akka.tcp://flink@fps-flink-jobmanager:6123]

2017-11-16 19:14:37,525 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - TaskManager akka://flink/user/taskmanager disconnects from JobManager akka.tcp://flink@fps-flink-jobmanager:6123/user/jobmanager: JobManager is no longer reachable

2017-11-16 19:15:50,827 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Cancelling all computations and discarding all cached data.

2017-11-16 19:16:23,596 INFO  com.amazonaws.http.AmazonHttpClient                           - Unable to execute HTTP request: The target server failed to respond

org.apache.http.NoHttpResponseException: The target server failed to respond

  at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:95)

  at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:62)

  at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:254)

  at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:289)

  at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:252)

  at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:191)

  at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:300)

  at com.amazonaws.http.protocol.SdkHttpRequestExecutor.doReceiveResponse(SdkHttpRequestExecutor.java:66)

  at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:127)

  at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:715)

  at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:520)

  at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)

  at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)

  at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:384)

  at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)

  at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)

  at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:976)

  at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:956)

  at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:892)

  at org.apache.hadoop.fs.s3a.S3AFileSystem.deleteUnnecessaryFakeDirectories(S3AFileSystem.java:1144)

  at org.apache.hadoop.fs.s3a.S3AFileSystem.finishedWrite(S3AFileSystem.java:1133)

  at org.apache.hadoop.fs.s3a.S3AOutputStream.close(S3AOutputStream.java:142)

  at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)

  at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)

  at org.apache.flink.runtime.fs.hdfs.HadoopDataOutputStream.close(HadoopDataOutputStream.java:48)

  at org.apache.flink.core.fs.ClosingFSDataOutputStream.close(ClosingFSDataOutputStream.java:64)

  at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:319)

  at org.apache.flink.runtime.checkpoint.AbstractAsyncSnapshotIOCallable.closeStreamAndGetStateHandle(AbstractAsyncSnapshotIOCallable.java:100)

  at org.apache.flink.runtime.state.DefaultOperatorStateBackend$1.performOperation(DefaultOperatorStateBackend.java:270)

  at org.apache.flink.runtime.state.DefaultOperatorStateBackend$1.performOperation(DefaultOperatorStateBackend.java:233)

  at org.apache.flink.runtime.io.async.AbstractAsyncIOCallable.call(AbstractAsyncIOCallable.java:72)

  at java.util.concurrent.FutureTask.run(FutureTask.java:266)

  at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:40)

  at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:906)

  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

  at java.util.concurrent.FutureTask.run(FutureTask.java:266)

  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

  at java.lang.Thread.run(Thread.java:748)

2017-11-16 19:16:35,816 INFO  org.apache.flink.runtime.taskmanager.Task                     - Attempting to fail task externally Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1) (484ebabbd13dce5e8503d88005bcdb6c).

2017-11-16 19:16:35,819 WARN  Remoting                                                      - Tried to associate with unreachable remote address [akka.tcp://flink@fps-flink-jobmanager:6123]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: [The remote system has a UID that has been quarantined. Association aborted.]

2017-11-16 19:16:45,817 INFO  org.apache.flink.runtime.taskmanager.Task                     - Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1) (484ebabbd13dce5e8503d88005bcdb6c) switched from RUNNING to FAILED.

java.lang.Exception: TaskManager akka://flink/user/taskmanager disconnects from JobManager akka.tcp://flink@fps-flink-jobmanager:6123/user/jobmanager: JobManager is no longer reachable

  at org.apache.flink.runtime.taskmanager.TaskManager.handleJobManagerDisconnect(TaskManager.scala:1095)

  at org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$handleMessage$1.applyOrElse(TaskManager.scala:311)

  at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)

  at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49)

  at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)

  at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)

  at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)

  at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)

  at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)

  at akka.actor.Actor$class.aroundReceive(Actor.scala:467)

  at org.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.scala:120)

  at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)

  at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:44)

  at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)

  at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)

  at akka.actor.ActorCell.invoke(ActorCell.scala:486)

  at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)

  at akka.dispatch.Mailbox.run(Mailbox.scala:220)

  at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)

  at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

  at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

  at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

  at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

2017-11-16 19:16:55,815 INFO  org.apache.flink.runtime.state.DefaultOperatorStateBackend    - DefaultOperatorStateBackend snapshot (File Stream Factory @ s3a://zendesk-usw2-fraud-prevention-production/checkpoints/3de8f35a1af689237e3c5c94023aba3f, asynchronous part) in thread Thread[pool-7-thread-647,5,Flink Task Threads] took 202637 ms.

2017-11-16 19:16:55,816 INFO  org.apache.flink.runtime.taskmanager.Task                     - Triggering cancellation of task code Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1) (484ebabbd13dce5e8503d88005bcdb6c).

2017-11-16 19:17:00,825 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Disassociating from JobManager

2017-11-16 19:17:00,825 INFO  org.apache.flink.runtime.blob.BlobCache                       - Shutting down BlobCache

2017-11-16 19:17:13,677 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@fps-flink-jobmanager:6123/user/jobmanager (attempt 1, timeout: 500 milliseconds)

2017-11-16 19:17:13,687 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Successful registration at JobManager (akka.tcp://flink@fps-flink-jobmanager:6123/user/jobmanager), starting network stack and library cache.

2017-11-16 19:17:13,687 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Determined BLOB server address to be fps-flink-jobmanager/10.231.37.240:6124<https://urldefense.proofpoint.com/v2/url?u=http-3A__10.231.37.240-3A6124&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=nAwcmsyRU3-7VPjrj0DiK-dDoFxIx4Nmqgrfges--HM&s=FhAuAaOIb0r_xZ86VZMb-2RM8mDSGnbqVx3dsd5e9Hw&e=>. Starting BLOB cache.

2017-11-16 19:17:13,687 INFO  org.apache.flink.runtime.blob.BlobCache                       - Created BLOB cache storage directory /tmp/blobStore-133033d7-a3c1-4ffb-b360-3d0efdc355aa

2017-11-16 19:17:31,097 WARN  org.apache.flink.runtime.taskmanager.Task                     - Task 'Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1)' did not react to cancelling signal, but is stuck in method:

 java.util.regex.Pattern.matcher(Pattern.java:1093)

java.lang.String.replace(String.java:2239)

com.trueaccord.scalapb.textformat.TextFormatUtils$.escapeDoubleQuotesAndBackslashes(TextFormatUtils.scala:160)

com.trueaccord.scalapb.textformat.TextGenerator.addMaybeEscape(TextGenerator.scala:29)

com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:74)

com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:41)

com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:28)

com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)

com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)

scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)

com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)

com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:70)

com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:37)

com.trueaccord.scalapb.textformat.Printer$$anonfun$printField$1.apply(Printer.scala:25)

com.trueaccord.scalapb.textformat.Printer$$anonfun$printField$1.apply(Printer.scala:25)

scala.collection.Iterator$class.foreach(Iterator.scala:742)

scala.collection.AbstractIterator.foreach(Iterator.scala:1194)

scala.collection.IterableLike$class.foreach(IterableLike.scala:72)

scala.collection.AbstractIterable.foreach(Iterable.scala:54)

com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:25)

com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)

com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)

scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)

com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)

com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:70)

com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:37)

com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:28)

com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)

com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)

scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)

com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)

com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:8)

com.trueaccord.scalapb.textformat.Printer$.printToString(Printer.scala:19)

com.trueaccord.scalapb.TextFormat$.printToUnicodeString(TextFormat.scala:34)

com.zendesk.fraud_prevention.data_types.MaxwellEvent.toString(MaxwellEvent.scala:212)

java.lang.String.valueOf(String.java:2994)

scala.collection.mutable.StringBuilder.append(StringBuilder.scala:200)

scala.collection.TraversableOnce$$anonfun$addString$1.apply(TraversableOnce.scala:362)

scala.collection.immutable.List.foreach(List.scala:381)

scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:355)

scala.collection.AbstractTraversable.addString(Traversable.scala:104)

scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:321)

scala.collection.AbstractTraversable.mkString(Traversable.scala:104)

scala.collection.TraversableLike$class.toString(TraversableLike.scala:645)

scala.collection.SeqLike$class.toString(SeqLike.scala:682)

scala.collection.AbstractSeq.toString(Seq.scala:41)

java.lang.String.valueOf(String.java:2994)

java.lang.StringBuilder.append(StringBuilder.java:131)

scala.StringContext.standardInterpolator(StringContext.scala:125)

scala.StringContext.s(StringContext.scala:95)

com.zendesk.fraud_prevention.functions.MaxwellFilterFlatMapper.flatMap(MaxwellFilterFlatMapper.scala:12)

com.zendesk.fraud_prevention.functions.MaxwellFilterFlatMapper.flatMap(MaxwellFilterFlatMapper.scala:10)

org.apache.flink.streaming.api.operators.StreamFlatMap.processElement(StreamFlatMap.java:50)

org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:528)

org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:503)

org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:483)

org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:891)

org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:869)

org.apache.flink.streaming.api.operators.StreamSourceContexts$ManualWatermarkContext.processAndCollect(StreamSourceContexts.java:304)

org.apache.flink.streaming.api.operators.StreamSourceContexts$WatermarkContext.collect(StreamSourceContexts.java:393)

org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher.emitRecord(AbstractFetcher.java:233)

org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.emitRecord(Kafka09Fetcher.java:193)

org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.runFetchLoop(Kafka09Fetcher.java:152)

org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.run(FlinkKafkaConsumerBase.java:483)

org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:87)

org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:55)

org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:95)

org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:263)

org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)

java.lang.Thread.run(Thread.java:748)



2017-11-16 19:17:31,106 WARN  org.apache.flink.runtime.taskmanager.Task                     - Task 'Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1)' did not react to cancelling signal, but is stuck in method:

 com.trueaccord.scalapb.textformat.TextGenerator.add(TextGenerator.scala:19)

com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:33)

com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:28)

com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)

com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)

scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)

com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)

com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:70)

com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:37)

com.trueaccord.scalapb.textformat.Printer$$anonfun$printField$1.apply(Printer.scala:25)

com.trueaccord.scalapb.textformat.Printer$$anonfun$printField$1.apply(Printer.scala:25)

scala.collection.Iterator$class.foreach(Iterator.scala:742)

scala.collection.AbstractIterator.foreach(Iterator.scala:1194)

scala.collection.IterableLike$class.foreach(IterableLike.scala:72)

scala.collection.AbstractIterable.foreach(Iterable.scala:54)

com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:25)

com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)

com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)

scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)

com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)

com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:70)

com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:37)

com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:28)

com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)

com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)

scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)

com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)

com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:8)

com.trueaccord.scalapb.textformat.Printer$.printToString(Printer.scala:19)

com.trueaccord.scalapb.TextFormat$.printToUnicodeString(TextFormat.scala:34)

com.zendesk.fraud_prevention.data_types.MaxwellEvent.toString(MaxwellEvent.scala:212)

java.lang.String.valueOf(String.java:2994)

scala.collection.mutable.StringBuilder.append(StringBuilder.scala:200)

scala.collection.TraversableOnce$$anonfun$addString$1.apply(TraversableOnce.scala:362)

scala.collection.immutable.List.foreach(List.scala:381)

scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:355)

scala.collection.AbstractTraversable.addString(Traversable.scala:104)

scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:321)

scala.collection.AbstractTraversable.mkString(Traversable.scala:104)

scala.collection.TraversableLike$class.toString(TraversableLike.scala:645)

scala.collection.SeqLike$class.toString(SeqLike.scala:682)

scala.collection.AbstractSeq.toString(Seq.scala:41)

java.lang.String.valueOf(String.java:2994)

java.lang.StringBuilder.append(StringBuilder.java:131)

scala.StringContext.standardInterpolator(StringContext.scala:125)

scala.StringContext.s(StringContext.scala:95)

com.zendesk.fraud_prevention.functions.MaxwellFilterFlatMapper.flatMap(MaxwellFilterFlatMapper.scala:12)

com.zendesk.fraud_prevention.functions.MaxwellFilterFlatMapper.flatMap(MaxwellFilterFlatMapper.scala:10)

org.apache.flink.streaming.api.operators.StreamFlatMap.processElement(StreamFlatMap.java:50)

org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:528)

org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:503)

org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:483)

org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:891)

org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:869)

org.apache.flink.streaming.api.operators.StreamSourceContexts$ManualWatermarkContext.processAndCollect(StreamSourceContexts.java:304)

org.apache.flink.streaming.api.operators.StreamSourceContexts$WatermarkContext.collect(StreamSourceContexts.java:393)

org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher.emitRecord(AbstractFetcher.java:233)

org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.emitRecord(Kafka09Fetcher.java:193)

org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.runFetchLoop(Kafka09Fetcher.java:152)

org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.run(FlinkKafkaConsumerBase.java:483)

org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:87)

org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:55)

org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:95)

org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:263)

org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)

java.lang.Thread.run(Thread.java:748)



2017-11-16 19:18:05,486 WARN  org.apache.flink.runtime.taskmanager.Task                     - Task 'Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1)' did not react to cancelling signal, but is stuck in method:

 java.lang.String.replace(String.java:2240)

com.trueaccord.scalapb.textformat.TextGenerator.addMaybeEscape(TextGenerator.scala:29)

com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:74)

com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:41)

com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:28)

com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)

com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)

scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)

com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)

com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:70)

com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:37)

com.trueaccord.scalapb.textformat.Printer$$anonfun$printField$1.apply(Printer.scala:25)

com.trueaccord.scalapb.textformat.Printer$$anonfun$printField$1.apply(Printer.scala:25)

scala.collection.Iterator$class.foreach(Iterator.scala:742)

scala.collection.AbstractIterator.foreach(Iterator.scala:1194)

scala.collection.IterableLike$class.foreach(IterableLike.scala:72)

scala.collection.AbstractIterable.foreach(Iterable.scala:54)

com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:25)

com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)

com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)

scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)

com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)

com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:70)

com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:37)

com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:28)

com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)

com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)

scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)

com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)

com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:8)

com.trueaccord.scalapb.textformat.Printer$.printToString(Printer.scala:19)

com.trueaccord.scalapb.TextFormat$.printToUnicodeString(TextFormat.scala:34)

com.zendesk.fraud_prevention.data_types.MaxwellEvent.toString(MaxwellEvent.scala:212)

java.lang.String.valueOf(String.java:2994)

java.lang.StringBuilder.append(StringBuilder.java:131)

scala.StringContext.standardInterpolator(StringContext.scala:125)

scala.StringContext.s(StringContext.scala:95)

com.zendesk.fraud_prevention.functions.MaxwellFilterFlatMapper$$anonfun$flatMap$1.apply(MaxwellFilterFlatMapper.scala:14)

com.zendesk.fraud_prevention.functions.MaxwellFilterFlatMapper$$anonfun$flatMap$1.apply(MaxwellFilterFlatMapper.scala:13)

scala.collection.immutable.List.foreach(List.scala:381)

com.zendesk.fraud_prevention.functions.MaxwellFilterFlatMapper.flatMap(MaxwellFilterFlatMapper.scala:13)

com.zendesk.fraud_prevention.functions.MaxwellFilterFlatMapper.flatMap(MaxwellFilterFlatMapper.scala:10)

org.apache.flink.streaming.api.operators.StreamFlatMap.processElement(StreamFlatMap.java:50)

org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:528)

org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:503)

org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:483)

org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:891)

org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:869)

org.apache.flink.streaming.api.operators.StreamSourceContexts$ManualWatermarkContext.processAndCollect(StreamSourceContexts.java:304)

org.apache.flink.streaming.api.operators.StreamSourceContexts$WatermarkContext.collect(StreamSourceContexts.java:393)

org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher.emitRecord(AbstractFetcher.java:233)

org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.emitRecord(Kafka09Fetcher.java:193)

org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.runFetchLoop(Kafka09Fetcher.java:152)

org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.run(FlinkKafkaConsumerBase.java:483)

org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:87)

org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:55)

org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:95)

org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:263)

org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)

java.lang.Thread.run(Thread.java:748)



2017-11-16 19:18:34,375 INFO  org.apache.flink.runtime.taskmanager.Task                     - Freeing task resources for Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1) (484ebabbd13dce5e8503d88005bcdb6c).

2017-11-16 19:18:36,257 INFO  org.apache.flink.runtime.taskmanager.Task                     - Ensuring all FileSystem streams are closed for task Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1) (484ebabbd13dce5e8503d88005bcdb6c) [FAILED]

2017-11-16 19:18:36,618 ERROR org.apache.flink.runtime.taskmanager.TaskManager              - Cannot find task with ID 484ebabbd13dce5e8503d88005bcdb6c to unregister.