You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by "Stephan Ewen (JIRA)" <ji...@apache.org> on 2016/01/19 14:46:39 UTC
[jira] [Created] (FLINK-3260) ExecutionGraph gets stuck in state
FAILING
Stephan Ewen created FLINK-3260:
-----------------------------------
Summary: ExecutionGraph gets stuck in state FAILING
Key: FLINK-3260
URL: https://issues.apache.org/jira/browse/FLINK-3260
Project: Flink
Issue Type: Bug
Components: JobManager
Affects Versions: 0.10.1
Reporter: Stephan Ewen
Priority: Blocker
Fix For: 1.0.0
It is a bit of a rare case, but the following can currently happen:
1. Jobs runs for a while, some tasks are already finished.
2. Job fails, goes to state failing and restarting. Non-finished tasks fail or are canceled.
3. For the finished tasks, ask-futures from certain messages (for example for releasing intermediate result partitions) can fail (timeout) and cause the execution to go from FINISHED to FAILED
4. This triggers the execution graph to go to FAILING without ever going further into RESTARTING again
5. The job is stuck
It initially looks like this is mainly an issue for batch jobs (jobs where tasks do finish, rather than run infinitely).
The log that shows how this manifests:
{code}
--------------------------------------------------------------------------------
17:19:19,782 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started
17:19:19,844 INFO Remoting - Starting remoting
17:19:20,065 INFO Remoting - Remoting started; listening on addresses :[akka.tcp://flink@127.0.0.1:56722]
17:19:20,090 INFO org.apache.flink.runtime.blob.BlobServer - Created BLOB server storage directory /tmp/blobStore-6766f51a-1c51-4a03-acfb-08c2c29c11f0
17:19:20,096 INFO org.apache.flink.runtime.blob.BlobServer - Started BLOB server at 0.0.0.0:43327 - max concurrent requests: 50 - max backlog: 1000
17:19:20,113 INFO org.apache.flink.runtime.jobmanager.MemoryArchivist - Started memory archivist akka://flink/user/archive
17:19:20,115 INFO org.apache.flink.runtime.checkpoint.SavepointStoreFactory - No savepoint state backend configured. Using job manager savepoint state backend.
17:19:20,118 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager at akka.tcp://flink@127.0.0.1:56722/user/jobmanager.
17:19:20,123 INFO org.apache.flink.runtime.jobmanager.JobManager - JobManager akka.tcp://flink@127.0.0.1:56722/user/jobmanager was granted leadership with leader session ID None.
17:19:25,605 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at testing-worker-linux-docker-e6d6931f-3200-linux-4 (akka.tcp://flink@172.17.0.253:43702/user/taskmanager) as f213232054587f296a12140d56f63ed1. Current number of registered hosts is 1. Current number of alive task slots is 2.
17:19:26,758 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at testing-worker-linux-docker-e6d6931f-3200-linux-4 (akka.tcp://flink@172.17.0.253:43956/user/taskmanager) as f9e78baa14fb38c69517fb1bcf4f419c. Current number of registered hosts is 2. Current number of alive task slots is 4.
17:19:27,064 INFO org.apache.flink.api.java.ExecutionEnvironment - The job has 0 registered types and 0 default Kryo serializers
17:19:27,071 INFO org.apache.flink.client.program.Client - Starting client actor system
17:19:27,072 INFO org.apache.flink.runtime.client.JobClient - Starting JobClient actor system
17:19:27,110 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started
17:19:27,121 INFO Remoting - Starting remoting
17:19:27,143 INFO org.apache.flink.runtime.client.JobClient - Started JobClient actor system at 127.0.0.1:51198
17:19:27,145 INFO Remoting - Remoting started; listening on addresses :[akka.tcp://flink@127.0.0.1:51198]
17:19:27,325 INFO org.apache.flink.runtime.client.JobClientActor - Disconnect from JobManager null.
17:19:27,362 INFO org.apache.flink.runtime.client.JobClientActor - Received job Flink Java Job at Mon Jan 18 17:19:27 UTC 2016 (fa05fd25993a8742da09cc5023c1e38d).
17:19:27,362 INFO org.apache.flink.runtime.client.JobClientActor - Could not submit job Flink Java Job at Mon Jan 18 17:19:27 UTC 2016 (fa05fd25993a8742da09cc5023c1e38d), because there is no connection to a JobManager.
17:19:27,379 INFO org.apache.flink.runtime.client.JobClientActor - Connect to JobManager Actor[akka.tcp://flink@127.0.0.1:56722/user/jobmanager#-1489998809].
17:19:27,379 INFO org.apache.flink.runtime.client.JobClientActor - Connected to new JobManager akka.tcp://flink@127.0.0.1:56722/user/jobmanager.
17:19:27,379 INFO org.apache.flink.runtime.client.JobClientActor - Sending message to JobManager akka.tcp://flink@127.0.0.1:56722/user/jobmanager to submit job Flink Java Job at Mon Jan 18 17:19:27 UTC 2016 (fa05fd25993a8742da09cc5023c1e38d) and wait for progress
17:19:27,380 INFO org.apache.flink.runtime.client.JobClientActor - Upload jar files to job manager akka.tcp://flink@127.0.0.1:56722/user/jobmanager.
17:19:27,380 INFO org.apache.flink.runtime.client.JobClientActor - Submit job to the job manager akka.tcp://flink@127.0.0.1:56722/user/jobmanager.
17:19:27,453 INFO org.apache.flink.runtime.jobmanager.JobManager - Submitting job fa05fd25993a8742da09cc5023c1e38d (Flink Java Job at Mon Jan 18 17:19:27 UTC 2016).
17:19:27,591 INFO org.apache.flink.runtime.jobmanager.JobManager - Scheduling job fa05fd25993a8742da09cc5023c1e38d (Flink Java Job at Mon Jan 18 17:19:27 UTC 2016).
17:19:27,592 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (1/4) (c79bf4381462c690f5999f2d1949ab50) switched from CREATED to SCHEDULED
17:19:27,596 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (1/4) (c79bf4381462c690f5999f2d1949ab50) switched from SCHEDULED to DEPLOYING
17:19:27,597 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Deploying DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (1/4) (attempt #0) to testing-worker-linux-docker-e6d6931f-3200-linux-4
17:19:27,606 INFO org.apache.flink.runtime.client.JobClientActor - Job was successfully submitted to the JobManager akka.tcp://flink@127.0.0.1:56722/user/jobmanager.
17:19:27,630 INFO org.apache.flink.runtime.jobmanager.JobManager - Status of job fa05fd25993a8742da09cc5023c1e38d (Flink Java Job at Mon Jan 18 17:19:27 UTC 2016) changed to RUNNING.
17:19:27,637 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (2/4) (e73af91028cb76f7d3cd887cb6d66755) switched from CREATED to SCHEDULED
17:19:27,654 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:27 Job execution switched to status RUNNING.
17:19:27,655 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:27 DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat))(1/4) switched to SCHEDULED
17:19:27,656 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:27 DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat))(1/4) switched to DEPLOYING
17:19:27,666 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (2/4) (e73af91028cb76f7d3cd887cb6d66755) switched from SCHEDULED to DEPLOYING
17:19:27,667 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Deploying DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (2/4) (attempt #0) to testing-worker-linux-docker-e6d6931f-3200-linux-4
17:19:27,667 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:27 DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat))(2/4) switched to SCHEDULED
17:19:27,669 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:27 DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat))(2/4) switched to DEPLOYING
17:19:27,681 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (3/4) (807daf978da9dc347dca930822c78f8f) switched from CREATED to SCHEDULED
17:19:27,682 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (3/4) (807daf978da9dc347dca930822c78f8f) switched from SCHEDULED to DEPLOYING
17:19:27,682 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Deploying DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (3/4) (attempt #0) to testing-worker-linux-docker-e6d6931f-3200-linux-4
17:19:27,682 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (4/4) (ba45c37065b67fc8f5005a50d0e88fff) switched from CREATED to SCHEDULED
17:19:27,682 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (4/4) (ba45c37065b67fc8f5005a50d0e88fff) switched from SCHEDULED to DEPLOYING
17:19:27,685 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Deploying DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (4/4) (attempt #0) to testing-worker-linux-docker-e6d6931f-3200-linux-4
17:19:27,686 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:27 DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat))(3/4) switched to SCHEDULED
17:19:27,687 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:27 DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat))(3/4) switched to DEPLOYING
17:19:27,687 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:27 DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat))(4/4) switched to SCHEDULED
17:19:27,692 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:27 DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat))(4/4) switched to DEPLOYING
17:19:27,833 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (4/4) (ba45c37065b67fc8f5005a50d0e88fff) switched from DEPLOYING to RUNNING
17:19:27,839 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:27 DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat))(4/4) switched to RUNNING
17:19:27,840 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (2/4) (e73af91028cb76f7d3cd887cb6d66755) switched from DEPLOYING to RUNNING
17:19:27,852 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:27 DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat))(2/4) switched to RUNNING
17:19:27,896 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (1/4) (c79bf4381462c690f5999f2d1949ab50) switched from DEPLOYING to RUNNING
17:19:27,898 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (3/4) (807daf978da9dc347dca930822c78f8f) switched from DEPLOYING to RUNNING
17:19:27,901 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:27 DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat))(1/4) switched to RUNNING
17:19:27,905 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:27 DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat))(3/4) switched to RUNNING
17:19:28,114 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (3/4) (7997918330ecf2610b3298a8c8ef2852) switched from CREATED to SCHEDULED
17:19:28,126 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (1/4) (6421c8f88b191ea844619a40a523773b) switched from CREATED to SCHEDULED
17:19:28,134 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (1/4) (6421c8f88b191ea844619a40a523773b) switched from SCHEDULED to DEPLOYING
17:19:28,134 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Deploying CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (1/4) (attempt #0) to testing-worker-linux-docker-e6d6931f-3200-linux-4
17:19:28,126 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (2/4) (d0d011dc0a0823bcec5a57a369b334ed) switched from CREATED to SCHEDULED
17:19:28,139 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (2/4) (d0d011dc0a0823bcec5a57a369b334ed) switched from SCHEDULED to DEPLOYING
17:19:28,139 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Deploying CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (2/4) (attempt #0) to testing-worker-linux-docker-e6d6931f-3200-linux-4
17:19:28,117 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (4/4) (c928d19f73d700e80cdfad650689febb) switched from CREATED to SCHEDULED
17:19:28,134 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (3/4) (7997918330ecf2610b3298a8c8ef2852) switched from SCHEDULED to DEPLOYING
17:19:28,140 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Deploying CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (3/4) (attempt #0) to testing-worker-linux-docker-e6d6931f-3200-linux-4
17:19:28,140 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (4/4) (c928d19f73d700e80cdfad650689febb) switched from SCHEDULED to DEPLOYING
17:19:28,141 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Deploying CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (4/4) (attempt #0) to testing-worker-linux-docker-e6d6931f-3200-linux-4
17:19:28,147 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:28 CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(3/4) switched to SCHEDULED
17:19:28,153 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:28 CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(1/4) switched to SCHEDULED
17:19:28,153 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:28 CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(1/4) switched to DEPLOYING
17:19:28,153 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:28 CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(2/4) switched to SCHEDULED
17:19:28,153 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:28 CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(2/4) switched to DEPLOYING
17:19:28,156 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:28 CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(3/4) switched to DEPLOYING
17:19:28,158 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:28 CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(4/4) switched to SCHEDULED
17:19:28,165 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:28 CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(4/4) switched to DEPLOYING
17:19:28,238 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (2/4) (e73af91028cb76f7d3cd887cb6d66755) switched from RUNNING to FINISHED
17:19:28,242 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:28 DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat))(2/4) switched to FINISHED
17:19:28,308 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (3/4) (807daf978da9dc347dca930822c78f8f) switched from RUNNING to FINISHED
17:19:28,315 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (1/4) (c79bf4381462c690f5999f2d1949ab50) switched from RUNNING to FINISHED
17:19:28,317 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:28 DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat))(3/4) switched to FINISHED
17:19:28,318 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:28 DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat))(1/4) switched to FINISHED
17:19:28,328 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (1/4) (6421c8f88b191ea844619a40a523773b) switched from DEPLOYING to RUNNING
17:19:28,336 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:28 CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(1/4) switched to RUNNING
17:19:28,338 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (3/4) (7997918330ecf2610b3298a8c8ef2852) switched from DEPLOYING to RUNNING
17:19:28,341 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:28 CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(3/4) switched to RUNNING
17:19:28,459 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (4/4) (ba45c37065b67fc8f5005a50d0e88fff) switched from RUNNING to FINISHED
17:19:28,463 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:28 DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat))(4/4) switched to FINISHED
17:19:28,520 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (4/4) (c928d19f73d700e80cdfad650689febb) switched from DEPLOYING to RUNNING
17:19:28,529 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:28 CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(4/4) switched to RUNNING
17:19:28,540 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (2/4) (d0d011dc0a0823bcec5a57a369b334ed) switched from DEPLOYING to RUNNING
17:19:28,545 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:28 CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(2/4) switched to RUNNING
17:19:32,384 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at testing-worker-linux-docker-e6d6931f-3200-linux-4 (akka.tcp://flink@172.17.0.253:60852/user/taskmanager) as 5848d44035a164a0302da6c8701ff748. Current number of registered hosts is 3. Current number of alive task slots is 6.
17:19:32,598 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Reduce (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (1/1) (d0f8f69f9047c3154b860850955de20f) switched from CREATED to SCHEDULED
17:19:32,598 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Reduce (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (1/1) (d0f8f69f9047c3154b860850955de20f) switched from SCHEDULED to DEPLOYING
17:19:32,598 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Deploying Reduce (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (1/1) (attempt #0) to testing-worker-linux-docker-e6d6931f-3200-linux-4
17:19:32,605 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:32 Reduce (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(1/1) switched to SCHEDULED
17:19:32,605 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:32 Reduce (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(1/1) switched to DEPLOYING
17:19:32,611 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (4/4) (c928d19f73d700e80cdfad650689febb) switched from RUNNING to FINISHED
17:19:32,614 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:32 CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(4/4) switched to FINISHED
17:19:32,717 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (1/4) (6421c8f88b191ea844619a40a523773b) switched from RUNNING to FINISHED
17:19:32,719 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:32 CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(1/4) switched to FINISHED
17:19:32,724 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Reduce (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (1/1) (d0f8f69f9047c3154b860850955de20f) switched from DEPLOYING to RUNNING
17:19:32,726 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:32 Reduce (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(1/1) switched to RUNNING
17:19:32,843 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (2/4) (d0d011dc0a0823bcec5a57a369b334ed) switched from RUNNING to FINISHED
17:19:32,845 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:32 CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(2/4) switched to FINISHED
17:19:33,092 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@172.17.0.253:43702] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
17:19:39,111 WARN Remoting - Tried to associate with unreachable remote address [akka.tcp://flink@172.17.0.253:43702]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: /172.17.0.253:43702
17:19:39,113 INFO org.apache.flink.runtime.jobmanager.JobManager - Task manager akka.tcp://flink@172.17.0.253:43702/user/taskmanager terminated.
17:19:39,114 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (3/4) (7997918330ecf2610b3298a8c8ef2852) switched from RUNNING to FAILED
17:19:39,120 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:39 CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(3/4) switched to FAILED
java.lang.Exception: The slot in which the task was executed has been released. Probably loss of TaskManager f213232054587f296a12140d56f63ed1 @ testing-worker-linux-docker-e6d6931f-3200-linux-4 - 2 slots - URL: akka.tcp://flink@172.17.0.253:43702/user/taskmanager
at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151)
at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)
at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)
at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156)
at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:792)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)
at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
at akka.actor.ActorCell.invoke(ActorCell.scala:486)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
at akka.dispatch.Mailbox.run(Mailbox.scala:221)
at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
17:19:39,129 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Reduce (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (1/1) (d0f8f69f9047c3154b860850955de20f) switched from RUNNING to CANCELING
17:19:39,132 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - DataSink (collect()) (1/1) (895e1ea552281a665ae390c966cdb3b7) switched from CREATED to CANCELED
17:19:39,149 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:39 Job execution switched to status FAILING.
java.lang.Exception: The slot in which the task was executed has been released. Probably loss of TaskManager f213232054587f296a12140d56f63ed1 @ testing-worker-linux-docker-e6d6931f-3200-linux-4 - 2 slots - URL: akka.tcp://flink@172.17.0.253:43702/user/taskmanager
at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151)
at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)
at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)
at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156)
at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:792)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)
at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
at akka.actor.ActorCell.invoke(ActorCell.scala:486)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
at akka.dispatch.Mailbox.run(Mailbox.scala:221)
at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
17:19:39,173 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:39 Reduce (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(1/1) switched to CANCELING
17:19:39,173 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:39 DataSink (collect())(1/1) switched to CANCELED
17:19:39,174 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Reduce (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (1/1) (d0f8f69f9047c3154b860850955de20f) switched from CANCELING to FAILED
17:19:39,177 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:39 Reduce (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(1/1) switched to FAILED
java.lang.Exception: The slot in which the task was executed has been released. Probably loss of TaskManager f213232054587f296a12140d56f63ed1 @ testing-worker-linux-docker-e6d6931f-3200-linux-4 - 2 slots - URL: akka.tcp://flink@172.17.0.253:43702/user/taskmanager
at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151)
at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)
at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)
at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156)
at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:792)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)
at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
at akka.actor.ActorCell.invoke(ActorCell.scala:486)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
at akka.dispatch.Mailbox.run(Mailbox.scala:221)
at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
17:19:39,179 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:39 Job execution switched to status RESTARTING.
17:19:39,179 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Delaying retry of job execution for 10000 ms ...
17:19:39,179 INFO org.apache.flink.runtime.instance.InstanceManager - Unregistered task manager akka.tcp://flink@172.17.0.253:43702/user/taskmanager. Number of registered task managers 2. Number of available slots 4.
17:19:39,179 INFO org.apache.flink.runtime.jobmanager.JobManager - Status of job fa05fd25993a8742da09cc5023c1e38d (Flink Java Job at Mon Jan 18 17:19:27 UTC 2016) changed to FAILING.
java.lang.Exception: The slot in which the task was executed has been released. Probably loss of TaskManager f213232054587f296a12140d56f63ed1 @ testing-worker-linux-docker-e6d6931f-3200-linux-4 - 2 slots - URL: akka.tcp://flink@172.17.0.253:43702/user/taskmanager
at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151)
at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)
at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)
at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156)
at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:792)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)
at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
at akka.actor.ActorCell.invoke(ActorCell.scala:486)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
at akka.dispatch.Mailbox.run(Mailbox.scala:221)
at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
17:19:39,180 INFO org.apache.flink.runtime.jobmanager.JobManager - Status of job fa05fd25993a8742da09cc5023c1e38d (Flink Java Job at Mon Jan 18 17:19:27 UTC 2016) changed to RESTARTING.
17:19:42,766 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (2/4) (d0d011dc0a0823bcec5a57a369b334ed) switched from FINISHED to FAILED
17:19:42,773 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:42 CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(2/4) switched to FAILED
java.lang.IllegalStateException: Update task on instance f213232054587f296a12140d56f63ed1 @ testing-worker-linux-docker-e6d6931f-3200-linux-4 - 2 slots - URL: akka.tcp://flink@172.17.0.253:43702/user/taskmanager failed due to:
at org.apache.flink.runtime.executiongraph.Execution$5.onFailure(Execution.java:915)
at akka.dispatch.OnFailure.internal(Future.scala:228)
at akka.dispatch.OnFailure.internal(Future.scala:227)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:174)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:171)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at scala.runtime.AbstractPartialFunction.applyOrElse(AbstractPartialFunction.scala:25)
at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:136)
at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:134)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink@172.17.0.253:43702/user/taskmanager#-1712955384]] after [10000 ms]
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:333)
at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117)
at scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694)
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691)
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467)
at akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419)
at akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423)
at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375)
at java.lang.Thread.run(Thread.java:745)
17:19:42,774 INFO org.apache.flink.runtime.jobmanager.JobManager - Status of job fa05fd25993a8742da09cc5023c1e38d (Flink Java Job at Mon Jan 18 17:19:27 UTC 2016) changed to FAILING.
java.lang.IllegalStateException: Update task on instance f213232054587f296a12140d56f63ed1 @ testing-worker-linux-docker-e6d6931f-3200-linux-4 - 2 slots - URL: akka.tcp://flink@172.17.0.253:43702/user/taskmanager failed due to:
at org.apache.flink.runtime.executiongraph.Execution$5.onFailure(Execution.java:915)
at akka.dispatch.OnFailure.internal(Future.scala:228)
at akka.dispatch.OnFailure.internal(Future.scala:227)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:174)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:171)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at scala.runtime.AbstractPartialFunction.applyOrElse(AbstractPartialFunction.scala:25)
at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:136)
at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:134)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink@172.17.0.253:43702/user/taskmanager#-1712955384]] after [10000 ms]
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:333)
at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117)
at scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694)
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691)
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467)
at akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419)
at akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423)
at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375)
at java.lang.Thread.run(Thread.java:745)
17:19:42,780 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:42 Job execution switched to status FAILING.
java.lang.IllegalStateException: Update task on instance f213232054587f296a12140d56f63ed1 @ testing-worker-linux-docker-e6d6931f-3200-linux-4 - 2 slots - URL: akka.tcp://flink@172.17.0.253:43702/user/taskmanager failed due to:
at org.apache.flink.runtime.executiongraph.Execution$5.onFailure(Execution.java:915)
at akka.dispatch.OnFailure.internal(Future.scala:228)
at akka.dispatch.OnFailure.internal(Future.scala:227)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:174)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:171)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at scala.runtime.AbstractPartialFunction.applyOrElse(AbstractPartialFunction.scala:25)
at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:136)
at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:134)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink@172.17.0.253:43702/user/taskmanager#-1712955384]] after [10000 ms]
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:333)
at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117)
at scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694)
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691)
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467)
at akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419)
at akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423)
at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375)
at java.lang.Thread.run(Thread.java:745)
17:19:49,152 WARN Remoting - Tried to associate with unreachable remote address [akka.tcp://flink@172.17.0.253:43702]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: /172.17.0.253:43702
17:19:59,172 WARN Remoting - Tried to associate with unreachable remote address [akka.tcp://flink@172.17.0.253:43702]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: /172.17.0.253:43702
17:20:09,191 WARN Remoting - Tried to associate with unreachable remote address [akka.tcp://flink@172.17.0.253:43702]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: /172.17.0.253:43702
17:24:32,423 INFO org.apache.flink.runtime.jobmanager.JobManager - Stopping JobManager akka.tcp://flink@127.0.0.1:56722/user/jobmanager.
17:24:32,440 ERROR org.apache.flink.test.recovery.TaskManagerProcessFailureBatchRecoveryITCase -
--------------------------------------------------------------------------------
Test testTaskManagerProcessFailure[0](org.apache.flink.test.recovery.TaskManagerProcessFailureBatchRecoveryITCase) failed with:
java.lang.AssertionError: The program did not finish in time
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.junit.Assert.assertFalse(Assert.java:64)
at org.apache.flink.test.recovery.AbstractTaskManagerProcessFailureRecoveryTest.testTaskManagerProcessFailure(AbstractTaskManagerProcessFailureRecoveryTest.java:212)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
at org.junit.rules.RunRules.evaluate(RunRules.java:20)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at org.junit.runners.Suite.runChild(Suite.java:127)
at org.junit.runners.Suite.runChild(Suite.java:26)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:283)
at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:173)
at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153)
at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:128)
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)