You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by 陈梓立 <wa...@gmail.com> on 2018/08/02 07:14:32 UTC

[HELP] Bisect

Hi devs,

At tag release-1.5.1, I find LOG contains noise like below.

```
13:58:40,752 INFO
org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Job
a18a72a9c335b5cc964f43075418dedc reached globally terminal state FINISHED.
13:58:40,753 INFO  org.apache.flink.runtime.minicluster.MiniCluster
      - Shutting down Flink Mini Cluster
13:58:40,753 INFO
org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Shutting
down rest endpoint.
13:58:40,754 INFO  org.apache.flink.runtime.jobmaster.JobMaster
      - Stopping the JobMaster for job Flink Java Job at Thu Aug 02
13:58:29 CST 2018(a18a72a9c335b5cc964f43075418dedc).
13:58:40,754 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor
      - Stopping TaskExecutor akka://flink/user/taskmanager_0.
13:58:40,755 INFO
org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager  -
Shutting down TaskExecutorLocalStateStoresManager.
13:58:40,755 INFO  org.apache.flink.runtime.jobmaster.JobMaster
      - Close ResourceManager connection 58178cf2968f728837df335f4a8a63a5:
JobManager is shutting down..
13:58:40,755 INFO
org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Stopping
dispatcher akka://flink/user/dispatcher65cb755a-d441-48b5-8942-52465b77bae2.
13:58:40,755 INFO
org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Stopping
all currently running jobs of dispatcher
akka://flink/user/dispatcher65cb755a-d441-48b5-8942-52465b77bae2.
13:58:40,756 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool
      - Suspending SlotPool.
13:58:40,756 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool
      - Stopping SlotPool.
13:58:40,758 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  -
Disconnect job manager
93c37c8311c692f1a5b0c7caaa074a3a@akka://flink/user/jobmanager_1
for job a18a72a9c335b5cc964f43075418dedc from the resource manager.
13:58:40,758 INFO
org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Closing
the SlotManager.
13:58:40,758 INFO
org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  -
Suspending the SlotManager.
13:58:40,759 INFO
org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  -
Unregister TaskManager 8c28415e2960d2f7a49a12cf329e2444 from the
SlotManager.
13:58:40,759 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor
      - JobManager for job a18a72a9c335b5cc964f43075418dedc with leader id
93c37c8311c692f1a5b0c7caaa074a3a lost leadership.
13:58:40,759 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - Slot
request with allocation id a3f91d383ba2e647c07894a0974af2fb for job
a18a72a9c335b5cc964f43075418dedc failed.
org.apache.flink.util.FlinkException: The assigned slot
2cd8d254-283d-472c-8ae2-6499e4ba04bf_7 was removed.
at
org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:895)
...
13:58:40,761 INFO
org.apache.flink.runtime.rest.handler.legacy.backpressure.StackTraceSampleCoordinator
- Shutting down stack trace sample coordinator.
13:58:40,763 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - Slot
request with allocation id 25f5cd03e1bd5654c54c20143fd05a7e for job
a18a72a9c335b5cc964f43075418dedc failed.
org.apache.flink.util.FlinkException: The assigned slot
2cd8d254-283d-472c-8ae2-6499e4ba04bf_6 was removed.
at
org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:895)
...
13:58:40,766 INFO
org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Stopped
dispatcher akka://flink/user/dispatcher65cb755a-d441-48b5-8942-52465b77bae2.
13:58:40,766 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - Slot
request with allocation id 32047761d6066628b196ac13b7b2993e for job
a18a72a9c335b5cc964f43075418dedc failed.
org.apache.flink.util.FlinkException: The assigned slot
2cd8d254-283d-472c-8ae2-6499e4ba04bf_5 was removed.
at
org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:895)
...

(repeat likely)
13:58:40,768 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - Slot
request with allocation id befc46831f1f0ec5ef54cd660d12aa88 for job
a18a72a9c335b5cc964f43075418dedc failed.
org.apache.flink.util.FlinkException: The assigned slot
2cd8d254-283d-472c-8ae2-6499e4ba04bf_1 was removed.
at
org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:895)
...
13:58:40,784 INFO
org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Removing
cache directory
/var/folders/ff/92vml6qd147dncsvvw28qlrc0000gn/T/flink-web-ui
13:58:40,785 INFO  org.apache.flink.runtime.io.disk.iomanager.IOManager
      - I/O manager removed spill file directory
/var/folders/ff/92vml6qd147dncsvvw28qlrc0000gn/T/flink-io-22be945e-daeb-49e9-ab3b-22521e9d2280
13:58:40,785 INFO  org.apache.flink.runtime.io.network.NetworkEnvironment
      - Shutting down the network environment and its components.
13:58:40,796 INFO  org.apache.flink.runtime.taskexecutor.JobLeaderService
      - Stop job leader service.
13:58:40,796 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor
      - Stopped TaskExecutor akka://flink/user/taskmanager_0.
13:58:40,804 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService
      - Stopping Akka RPC service.
13:58:40,834 INFO  org.apache.flink.runtime.blob.PermanentBlobCache
      - Shutting down BLOB cache
13:58:40,834 INFO  org.apache.flink.runtime.blob.TransientBlobCache
      - Shutting down BLOB cache
13:58:40,836 INFO  org.apache.flink.runtime.blob.BlobServer
      - Stopped BLOB server at 0.0.0.0:50685
13:58:40,836 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService
      - Stopped Akka RPC service.
```

Note that the job is FINISHED and RM, TM, JM and so on shutdown correctly.

To reproduce it, run
`flink-examples/flink-examples-batch/src/main/java/org/apache/flink/examples/java/ml/LinearRegression.java`

The code snip relevant is here
<https://github.com/apache/flink/blob/628b71dfabe72a99eb6d54994fc870abed1f0268/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/SlotManager.java#L783>
.

But it could not be reproduced in master, which infers there is a commit
suppress it. So I come to here ask for help to accurately bisect it.

Thanks,
tison

Re: [HELP] Bisect

Posted by 陈梓立 <wa...@gmail.com>.
I think I find it at
https://github.com/apache/flink/commit/4e616a8362744c15a71e0e57ad68fbb52266c837

Thanks,
tison

陈梓立 <wa...@gmail.com> 于2018年8月2日周四 下午3:14写道:

> Hi devs,
>
> At tag release-1.5.1, I find LOG contains noise like below.
>
> ```
> 13:58:40,752 INFO
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Job
> a18a72a9c335b5cc964f43075418dedc reached globally terminal state FINISHED.
> 13:58:40,753 INFO  org.apache.flink.runtime.minicluster.MiniCluster
>       - Shutting down Flink Mini Cluster
> 13:58:40,753 INFO
> org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Shutting
> down rest endpoint.
> 13:58:40,754 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>       - Stopping the JobMaster for job Flink Java Job at Thu Aug 02
> 13:58:29 CST 2018(a18a72a9c335b5cc964f43075418dedc).
> 13:58:40,754 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor
>       - Stopping TaskExecutor akka://flink/user/taskmanager_0.
> 13:58:40,755 INFO
> org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager  -
> Shutting down TaskExecutorLocalStateStoresManager.
> 13:58:40,755 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>       - Close ResourceManager connection 58178cf2968f728837df335f4a8a63a5:
> JobManager is shutting down..
> 13:58:40,755 INFO
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Stopping
> dispatcher akka://flink/user/dispatcher65cb755a-d441-48b5-8942-52465b77bae2.
> 13:58:40,755 INFO
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Stopping
> all currently running jobs of dispatcher
> akka://flink/user/dispatcher65cb755a-d441-48b5-8942-52465b77bae2.
> 13:58:40,756 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool
>       - Suspending SlotPool.
> 13:58:40,756 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool
>       - Stopping SlotPool.
> 13:58:40,758 INFO
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  -
> Disconnect job manager 93c37c8311c692f1a5b0c7caaa074a3a@akka://flink/user/jobmanager_1
> for job a18a72a9c335b5cc964f43075418dedc from the resource manager.
> 13:58:40,758 INFO
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Closing
> the SlotManager.
> 13:58:40,758 INFO
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  -
> Suspending the SlotManager.
> 13:58:40,759 INFO
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  -
> Unregister TaskManager 8c28415e2960d2f7a49a12cf329e2444 from the
> SlotManager.
> 13:58:40,759 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor
>       - JobManager for job a18a72a9c335b5cc964f43075418dedc with leader id
> 93c37c8311c692f1a5b0c7caaa074a3a lost leadership.
> 13:58:40,759 INFO
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - Slot
> request with allocation id a3f91d383ba2e647c07894a0974af2fb for job
> a18a72a9c335b5cc964f43075418dedc failed.
> org.apache.flink.util.FlinkException: The assigned slot
> 2cd8d254-283d-472c-8ae2-6499e4ba04bf_7 was removed.
> at
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:895)
> ...
> 13:58:40,761 INFO
> org.apache.flink.runtime.rest.handler.legacy.backpressure.StackTraceSampleCoordinator
> - Shutting down stack trace sample coordinator.
> 13:58:40,763 INFO
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - Slot
> request with allocation id 25f5cd03e1bd5654c54c20143fd05a7e for job
> a18a72a9c335b5cc964f43075418dedc failed.
> org.apache.flink.util.FlinkException: The assigned slot
> 2cd8d254-283d-472c-8ae2-6499e4ba04bf_6 was removed.
> at
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:895)
> ...
> 13:58:40,766 INFO
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Stopped
> dispatcher akka://flink/user/dispatcher65cb755a-d441-48b5-8942-52465b77bae2.
> 13:58:40,766 INFO
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - Slot
> request with allocation id 32047761d6066628b196ac13b7b2993e for job
> a18a72a9c335b5cc964f43075418dedc failed.
> org.apache.flink.util.FlinkException: The assigned slot
> 2cd8d254-283d-472c-8ae2-6499e4ba04bf_5 was removed.
> at
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:895)
> ...
>
> (repeat likely)
> 13:58:40,768 INFO
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - Slot
> request with allocation id befc46831f1f0ec5ef54cd660d12aa88 for job
> a18a72a9c335b5cc964f43075418dedc failed.
> org.apache.flink.util.FlinkException: The assigned slot
> 2cd8d254-283d-472c-8ae2-6499e4ba04bf_1 was removed.
> at
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:895)
> ...
> 13:58:40,784 INFO
> org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Removing
> cache directory
> /var/folders/ff/92vml6qd147dncsvvw28qlrc0000gn/T/flink-web-ui
> 13:58:40,785 INFO  org.apache.flink.runtime.io.disk.iomanager.IOManager
>       - I/O manager removed spill file directory
> /var/folders/ff/92vml6qd147dncsvvw28qlrc0000gn/T/flink-io-22be945e-daeb-49e9-ab3b-22521e9d2280
> 13:58:40,785 INFO  org.apache.flink.runtime.io.network.NetworkEnvironment
>       - Shutting down the network environment and its components.
> 13:58:40,796 INFO  org.apache.flink.runtime.taskexecutor.JobLeaderService
>       - Stop job leader service.
> 13:58:40,796 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor
>       - Stopped TaskExecutor akka://flink/user/taskmanager_0.
> 13:58:40,804 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService
>       - Stopping Akka RPC service.
> 13:58:40,834 INFO  org.apache.flink.runtime.blob.PermanentBlobCache
>       - Shutting down BLOB cache
> 13:58:40,834 INFO  org.apache.flink.runtime.blob.TransientBlobCache
>       - Shutting down BLOB cache
> 13:58:40,836 INFO  org.apache.flink.runtime.blob.BlobServer
>       - Stopped BLOB server at 0.0.0.0:50685
> 13:58:40,836 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService
>       - Stopped Akka RPC service.
> ```
>
> Note that the job is FINISHED and RM, TM, JM and so on shutdown correctly.
>
> To reproduce it, run
> `flink-examples/flink-examples-batch/src/main/java/org/apache/flink/examples/java/ml/LinearRegression.java`
>
> The code snip relevant is here
> <https://github.com/apache/flink/blob/628b71dfabe72a99eb6d54994fc870abed1f0268/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/SlotManager.java#L783>
> .
>
> But it could not be reproduced in master, which infers there is a commit
> suppress it. So I come to here ask for help to accurately bisect it.
>
> Thanks,
> tison
>